打印

哪种mcu指令结构处理以下情况最快?

[复制链接]
21825|64
手机看帖
扫描二维码
随时随地手机跟帖
跳转到指定楼层
楼主
teddeng|  楼主 | 2012-4-4 23:58 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式
来自 2楼
李富贵| | 2012-4-5 20:04 | 只看该作者
本帖最后由 李富贵 于 2012-4-5 20:10 编辑

楼猪我给你抄本书吧。

7-3 Transposing a Bit Matrix
The transpose of a matrix A is a matrix whose columns are the rows of A and whose rows are the columns of A.
Here we consider the problem of computing the transpose of a bit matrix whose elements are single bits that are
packed eight per byte, with rows and columns beginning on byte boundaries. This seemingly simple transformation is
surprisingly costly in instructions executed.
On most machines it would be very slow to load and store individual bits, mainly due to the code that would be
required to extract and (worse yet) to store individual bits. A better method is to partition the matrix into 8x8
submatrices, load each 8x8 submatrix into registers, compute the transpose of the submatrix in registers, and then
store the 8x8 result in the appropriate place in the target matrix. This section first discusses the problem of computing
the transpose of the 8x8 submatrix.
It doesn't matter whether the matrix is stored in row-major or column-major order; computing the transpose consists
of the same operations in either event. Assuming for discussion that it's in row-major order, an 8x8 submatrix is
loaded into eight registers with eight load byte instructions, addressing a column of the source matrix. That is, the
addresses referenced by the load byte instructions are separated by multiples of the source matrix width in bytes.
After the transpose of the 8x8 submatrix is computed, it is stored in a column of the target matrix—that is, it is stored
with eight store byte instructions into locations separated by multiples of the width of the target matrix in bytes (which
is different from the width of the source matrix if the matrices are not square). Thus, we are given eight 8-bit
quantities right-justified in registers a0, a1, …, a7, and we wish to compute eight 8-bit quantities right-justified in
registers b0, b1, …, b7, for use in the store byte instructions. This is illustrated below, where each digit and letter
represents a single bit. Notice that we consider the main diagonal to run from bit 7 of byte 0 to bit 0 of byte 7. Some
readers with a little-endian background may be accustomed to thinking of the main diagonal as running from bit 0 of
byte 0 to bit 7 of byte 7.
 a0 = 0123 4567 b0 = 08go wEMU a1 = 89ab cdef b1 = 19hp xFNV a2 =
ghij klmn b2 = 2aiq yGOW a3 = opqr stuv ==> b3 = 3bjr zHPX a4 = wxyz
ABCD b4 = 4cks AIQY a5 = EFGH IJKL b5 = 5dlt BJRZ a6 = MNOP
QRST b6 = 6emu CKS$ a7 = UVWX YZ$. b7 = 7fnv DLT.

The straightforward code for this problem is to select and place each result bit individually, as follows. The
multiplications and divisions represent left and right shifts, respectively.
b0 = (a0 & 128) | (a1 & 128)/2 | (a2 & 128)/4 | (a3 & 128)/8 | (a4 & 128)/16 | (a5 & 128)/32 | (a6 & 128)/64 | (a7 )/128; 
b1 = (a0 & 64)*2 | (a1 & 64) | (a2 & 64)/2 | (a3 & 64)/4 | (a4 & 64)/8 | (a5 & 64)/16 | (a6 & 64)/32 | (a7 & 64)/64;
b2 = (a0 & 32)*4 | (a1 & 32)*2 | (a2 & 32) | (a3 & 32)/2 | (a4 & 32)/4 | (a5 & 32)/8 | (a6 & 32)/16 | (a7 & 32)/32;
b3 = (a0 & 16)*8 | (a1 & 16)*4 | (a2 & 16)*2 | (a3 & 16) | (a4 & 16)/2 | (a5 & 16)/4 | (a6 & 16)/8 | (a7 & 16)/16;
b4 = (a0 & 8)*16 | (a1 & 8)*8 | (a2 & 8)*4 | (a3 & 8)*2 | (a4 & 8) | (a5 & 8)/2 | (a6 & 8)/4 | (a7 & 8)/8;
b5 = (a0 & 4)*32 | (a1 & 4)*16 | (a2 & 4)*8 | (a3 & 4)*4 | (a4 & 4)*2 | (a5 & 4) | (a6 & 4)/2 | (a7 & 4)/4;
b6 = (a0 & 2)*64 | (a1 & 2)*32 | (a2 & 2)*16 | (a3 & 2)*8 | (a4 & 2)*4 | (a5 & 2)*2 | (a6 & 2) | (a7 & 2)/2;
b7 = (a0 )*128| (a1 & 1)*64 | (a2 & 1)*32 | (a3 & 1)*16| (a4 & 1)*8 | (a5 & 1)*4 | (a6 & 1)*2 | (a7 & 1);


This executes in 174 instructions on most machines (62 and's, 56 shift's, and 56 or's). The or's can of course be
add's. On PowerPC it can be done, perhaps surprisingly, in 63 instructions (seven move register's and 56 rotate left
word immediate then mask insert's). We are not counting the load byte and store byte instructions, nor their
addressing code.

Although there does not seem to be a really great algorithm for this problem, the method to be described beats the
straightforward method by more than a factor of 2 on a basic RISC machine.
First, treat the 8x8-bit matrix as 16 2x2-bit matrices, and transpose each of the 16 2x2-bit matrices. Second, treat
the matrix as four 2x2 submatrices whose elements are 2x2-bit matrices and transpose each of the four 2x2
submatrices. Finally, treat the matrix as a 2x2 matrix whose elements are 4x4-bit matrices, and transpose the 2x2
matrix. These transformations are illustrated below.
0123 4567 082a 4c6e 08go 4cks 08go wEMU 89ab cdef 193b 5d7f 19hp
5dlt 19hp xFNV ghij klmn goiq ksmu 2aiq 6emu 2aiq yGOW opqr stuv
hpjr ltnv 3bjr 7fnv 3bjr zHPX ==> ==> ==> wxyz
ABCD wEyG AICK wEMU AIQY 4cks AIQY EFGH IJKL xFzH BJDL xFNV
BJRZ 5dlt BJRZ MNOP QRST MUOW QYS$ yGOW CKS$ 6emu CKS$ UVWX YZ$. NVPX RZT. zHPX DLT. 7fnv DLT.

Rather than carrying out these steps on the eight individual bytes in eight registers, a net improvement results from first
packing the bytes four to a register, performing the bit-swaps on the two registers, and then unpacking. A complete
procedure is shown in Figure 7-2. Parameter A is the address of the first byte of an 8x8 submatrix of the source
matrix, which is of size 8mx8n bits. Similarly, parameter B is the address of the first byte of an 8x8 submatrix in the
target matrix, which is of size 8nx8m bits. That is, the full source matrix is 8mxn bytes, and the full target matrix is
8nxm bytes.
Figure 7-2 Transposing an 8x8-bit matrix.

void transpose8(unsigned char A[8], int m, int n, unsigned char B[8]) {
unsigned x, y, t; // Load the array and pack it into x and y.
x = (A[0]<<24) | (A[m]<<16) | (A[2*m]<<8) | A[3*m];
y = (A[4*m]<<24) | (A[5*m]<<16) | (A[6*m]<<8) | A[7*m];
t = (x ^ (x >> 7)) & 0x00AA00AA; x = x ^ t ^ (t << 7);
t = (y ^ (y >> 7)) & 0x00AA00AA; y = y ^ t ^ (t << 7);
t = (x ^ (x >>14)) & 0x0000CCCC; x = x ^ t ^ (t <<14);
t = (y ^ (y >>14)) & 0x0000CCCC; y = y ^ t ^ (t <<14);
t = (x & 0xF0F0F0F0) | ((y >> 4) & 0x0F0F0F0F);
y = ((x << 4) & 0xF0F0F0F0) | (y & 0x0F0F0F0F); x = t;
B[0]=x>>24; B[n]=x>>16;B[2*n]=x>>8; B[3*n]=x; B[4*n]=y>>24; B[5*n]=y>>16; B[6*n]=y>>8; B[7*n]=y; }
The line
t = (x ^ (x >> 7)) & 0x00AA00AA; x = x ^ t ^ (t << 7);
is quite cryptic, for sure. It is swapping bits 1 and 8 (counting from the right), 3 and 10, 5 and 12, and so on, in word
x, while not moving bits 0, 2, 4, and so on. The swaps are done with the exclusive or method of bit swapping,
described on page 40. Word x, before and after the first round of swaps, is
0123 4567 89ab cdef ghij klmn opqr stuv 082a 4c6e 193b 5d7f goiq ksmu hpjr ltnv
To get a realistic comparison of these methods, the naive method described on page 109 was filled out into a
complete program similar to that of Figure 7-2. Both were compiled with the GNU C compiler to a target machine
that is very similar to the basic RISC. The resulting number of instructions, counting all load's, store's , addressing
code, prologs, and epilogs, is 219 for the naive code and 101 for Figure 7-2. (The prologs and epilogs were null,
except for a return branch instruction.) A version of the code of Figure 7-2 adapted to a 64-bit basic RISC (in which
x and y would be held in the same register) would be about 85 instructions.
The algorithm of Figure 7-2 runs from fine to coarse granularity, based on the lengths of the groups of bits that are
swapped. The method can also be run from coarse to fine granularity. To do this, first treat the 8x8-bit matrix as a
2x2 matrix whose elements are 4x4-bit matrices, and transpose the 2x2 matrix. Then treat each the four 4x4
submatrices as a 2x2 matrix whose elements are 2x2-bit matrices, and transpose each of the four 2x2 submatrices,

使用特权

评论回复
评分
参与人数 1威望 +1 收起 理由
Cortex-M0 + 1
板凳
ejack| | 2012-4-5 07:51 | 只看该作者
你是单单说速度最快,还是说指令效率最高?

使用特权

评论回复
地板
ayb_ice| | 2012-4-5 08:27 | 只看该作者
这种情况下51是比较快的

使用特权

评论回复
5
coody| | 2012-4-5 09:14 | 只看该作者
经常用到类似的处理,比如将8个位高低互换。一般我查表,最快。

使用特权

评论回复
6
teddeng|  楼主 | 2012-4-5 10:20 | 只看该作者
不是单字节旋转180度,而是把8个bit0取出来组成一个新的字节,依次处理8个位。实现很简单,不知道有没有哪种CPU有高效的指令。像楼上说的旋转180度,CM3一条指令就搞定了。印象里论坛提起过这问题,记得有人说某DSP有针对性的指令。我这个应用是LED矩阵的灰度控制,计算出每个LED的灰度值之后,所有LED需要从低到高逐位控制驱动电路,速度要求比较高,常规处理刷新率上不去,不想用CPLD。

使用特权

评论回复
7
liang7143| | 2012-4-5 12:41 | 只看该作者
如果空间够的话  查表最快

使用特权

评论回复
8
airwill| | 2012-4-5 13:05 | 只看该作者
51 的位指令的确可以体现不错的处理能力. 但是程序量很大(估计200多字节),看起来很多(因为不能用循环).
CM3 的移位指令很强悍, 单周期可以移多位, 完成 128 次移位操作, 就完成了这个任务. 全部在寄存器里捣鼓, 另外可以使用循环, 估计代码长度不会达到 200 字节. 当然有没有更聪明的操作方式, 暂时不好说.

使用特权

评论回复
9
lyjian| | 2012-4-5 13:20 | 只看该作者
51的位处理指令足够强悍了

使用特权

评论回复
10
lxyppc| | 2012-4-5 13:42 | 只看该作者
如果是32位处理器,可以把8x8bits转化成两个32bit数据,用位运算来做变换,可以快一点

使用特权

评论回复
11
coody| | 2012-4-5 14:12 | 只看该作者
看明白了,LZ是类似要将汉字字库旋转90度变成竖排在黑白LCD上显示那样的,我会先转换好字库,用51也可以转,然后存在SPI接口的FLASH中,或者传回PC保存成文件。

使用特权

评论回复
12
ayb_ice| | 2012-4-5 14:35 | 只看该作者
51的一位处理
moc c,acc.7
mov b.5,c

使用特权

评论回复
13
lgnativs| | 2012-4-5 14:53 | 只看该作者
0.j = 0;
1.i=0
2.取水平字节Hi;
3.算术右移一位;
4.垂直字节Vj算术左移;
5.i < 8? to->2 : to->6;
6.j < 8? to->1 : to->7;
7;over

相当于要执行一次8*8的循环,所需指令周期500-600左右

使用特权

评论回复
14
yhn1973| | 2012-4-5 15:32 | 只看该作者
把8个字节放在20H~27H中
MOV    C, 20H.0
MOV    ACC.0, C
MOV    C, 21H.1
MOV    ACC.1, C
MOV    C, 22H.2
MOV    ACC.2, C
MOV    C, 23H.3
MOV    ACC.3, C
MOV    C, 24H.4
MOV    ACC.4, C
MOV    C, 25H.5
MOV    ACC.5, C
MOV    C, 26H.6
MOV    ACC.6, C
MOV    C, 27H.7
MOV    ACC.7, C
结果在A中。
若用C8051,需要32个时钟,程序量为32字节。

使用特权

评论回复
15
lyjian| | 2012-4-5 16:08 | 只看该作者
楼上
你这只是处理一个字节而已

使用特权

评论回复
16
yhn1973| | 2012-4-5 16:23 | 只看该作者
呵呵,只是举了1个例子,方法都类似。

使用特权

评论回复
17
zhaofy521| | 2012-4-5 16:31 | 只看该作者
:L我表示还没看明白题

使用特权

评论回复
18
joyme| | 2012-4-5 16:48 | 只看该作者
LED需要从低到高逐位控制驱动电路

你跟驱动电路是怎么通信的,速度是多少?建议你自己模拟通信直接按位输出,不要重新组合

使用特权

评论回复
19
沈老| | 2012-4-5 19:13 | 只看该作者
用CPLD(FPGA)做一个硬件并串转换。(8字节并行写入,串行输出)

使用特权

评论回复
20
李富贵| | 2012-4-5 20:11 | 只看该作者
以上是《Hackers Delight》的部分章节,全书用电驴搜索并下载到。

使用特权

评论回复
发新帖 我要提问
您需要登录后才可以回帖 登录 | 注册

本版积分规则

56

主题

1215

帖子

9

粉丝