哪种mcu指令结构处理以下情况最快？

只看该作者 · 2012-4-4 23:58

内存中8个8位字节，依次取这8个字节的bit0，bit1，，，，bit7组成新的8个字节。形象点说相当于把这8个字节旋转了90度。有针对性的CPU最好，有好的方法也可以指点一下，谢谢！

只看该作者 · 2012-4-5 20:04

本帖最后由李富贵于 2012-4-5 20:10 编辑

楼猪我给你抄本书吧。

7-3 Transposing a Bit Matrix
The transpose of a matrix A is a matrix whose columns are the rows of A and whose rows are the columns of A.
Here we consider the problem of computing the transpose of a bit matrix whose elements are single bits that are
packed eight per byte, with rows and columns beginning on byte boundaries. This seemingly simple transformation is
surprisingly costly in instructions executed.
On most machines it would be very slow to load and store individual bits, mainly due to the code that would be
required to extract and (worse yet) to store individual bits. A better method is to partition the matrix into 8x8
submatrices, load each 8x8 submatrix into registers, compute the transpose of the submatrix in registers, and then
store the 8x8 result in the appropriate place in the target matrix. This section first discusses the problem of computing
the transpose of the 8x8 submatrix.
It doesn't matter whether the matrix is stored in row-major or column-major order; computing the transpose consists
of the same operations in either event. Assuming for discussion that it's in row-major order, an 8x8 submatrix is
loaded into eight registers with eight load byte instructions, addressing a column of the source matrix. That is, the
addresses referenced by the load byte instructions are separated by multiples of the source matrix width in bytes.
After the transpose of the 8x8 submatrix is computed, it is stored in a column of the target matrix—that is, it is stored
with eight store byte instructions into locations separated by multiples of the width of the target matrix in bytes (which
is different from the width of the source matrix if the matrices are not square). Thus, we are given eight 8-bit
quantities right-justified in registers a0, a1, …, a7, and we wish to compute eight 8-bit quantities right-justified in
registers b0, b1, …, b7, for use in the store byte instructions. This is illustrated below, where each digit and letter
represents a single bit. Notice that we consider the main diagonal to run from bit 7 of byte 0 to bit 0 of byte 7. Some
readers with a little-endian background may be accustomed to thinking of the main diagonal as running from bit 0 of
byte 0 to bit 7 of byte 7.

复制

 a0 = 0123 4567 b0 = 08go wEMU a1 = 89ab cdef b1 = 19hp xFNV a2 =

ghij klmn b2 = 2aiq yGOW a3 = opqr stuv ==> b3 = 3bjr zHPX a4 = wxyz

ABCD b4 = 4cks AIQY a5 = EFGH IJKL b5 = 5dlt BJRZ a6 = MNOP

QRST b6 = 6emu CKS$ a7 = UVWX YZ$. b7 = 7fnv DLT.

The straightforward code for this problem is to select and place each result bit individually, as follows. The
multiplications and divisions represent left and right shifts, respectively.

复制

b0 = (a0 & 128) | (a1 & 128)/2 | (a2 & 128)/4 | (a3 & 128)/8 | (a4 & 128)/16 | (a5 & 128)/32 | (a6 & 128)/64 | (a7 )/128; 

b1 = (a0 & 64)*2 | (a1 & 64) | (a2 & 64)/2 | (a3 & 64)/4 | (a4 & 64)/8 | (a5 & 64)/16 | (a6 & 64)/32 | (a7 & 64)/64; 

b2 = (a0 & 32)*4 | (a1 & 32)*2 | (a2 & 32) | (a3 & 32)/2 | (a4 & 32)/4 | (a5 & 32)/8 | (a6 & 32)/16 | (a7 & 32)/32; 

b3 = (a0 & 16)*8 | (a1 & 16)*4 | (a2 & 16)*2 | (a3 & 16) | (a4 & 16)/2 | (a5 & 16)/4 | (a6 & 16)/8 | (a7 & 16)/16; 

b4 = (a0 & 8)*16 | (a1 & 8)*8 | (a2 & 8)*4 | (a3 & 8)*2 | (a4 & 8) | (a5 & 8)/2 | (a6 & 8)/4 | (a7 & 8)/8; 

b5 = (a0 & 4)*32 | (a1 & 4)*16 | (a2 & 4)*8 | (a3 & 4)*4 | (a4 & 4)*2 | (a5 & 4) | (a6 & 4)/2 | (a7 & 4)/4; 

b6 = (a0 & 2)*64 | (a1 & 2)*32 | (a2 & 2)*16 | (a3 & 2)*8 | (a4 & 2)*4 | (a5 & 2)*2 | (a6 & 2) | (a7 & 2)/2; 

b7 = (a0 )*128| (a1 & 1)*64 | (a2 & 1)*32 | (a3 & 1)*16| (a4 & 1)*8 | (a5 & 1)*4 | (a6 & 1)*2 | (a7 & 1);

This executes in 174 instructions on most machines (62 and's, 56 shift's, and 56 or's). The or's can of course be
add's. On PowerPC it can be done, perhaps surprisingly, in 63 instructions (seven move register's and 56 rotate left
word immediate then mask insert's). We are not counting the load byte and store byte instructions, nor their
addressing code.

Although there does not seem to be a really great algorithm for this problem, the method to be described beats the
straightforward method by more than a factor of 2 on a basic RISC machine.
First, treat the 8x8-bit matrix as 16 2x2-bit matrices, and transpose each of the 16 2x2-bit matrices. Second, treat
the matrix as four 2x2 submatrices whose elements are 2x2-bit matrices and transpose each of the four 2x2
submatrices. Finally, treat the matrix as a 2x2 matrix whose elements are 4x4-bit matrices, and transpose the 2x2
matrix. These transformations are illustrated below.

复制

0123 4567 082a 4c6e 08go 4cks 08go wEMU 89ab cdef 193b 5d7f 19hp

5dlt 19hp xFNV ghij klmn goiq ksmu 2aiq 6emu 2aiq yGOW opqr stuv

hpjr ltnv 3bjr 7fnv 3bjr zHPX ==> ==> ==> wxyz

ABCD wEyG AICK wEMU AIQY 4cks AIQY EFGH IJKL xFzH BJDL xFNV

BJRZ 5dlt BJRZ MNOP QRST MUOW QYS$ yGOW CKS$ 6emu CKS$ UVWX YZ$. NVPX RZT. zHPX DLT. 7fnv DLT.

Rather than carrying out these steps on the eight individual bytes in eight registers, a net improvement results from first
packing the bytes four to a register, performing the bit-swaps on the two registers, and then unpacking. A complete
procedure is shown in Figure 7-2. Parameter A is the address of the first byte of an 8x8 submatrix of the source
matrix, which is of size 8mx8n bits. Similarly, parameter B is the address of the first byte of an 8x8 submatrix in the
target matrix, which is of size 8nx8m bits. That is, the full source matrix is 8mxn bytes, and the full target matrix is
8nxm bytes.
Figure 7-2 Transposing an 8x8-bit matrix.

复制



void transpose8(unsigned char A[8], int m, int n, unsigned char B[8]) {

unsigned x, y, t; // Load the array and pack it into x and y. 

x = (A[0]<<24) | (A[m]<<16) | (A[2*m]<<8) | A[3*m]; 

y = (A[4*m]<<24) | (A[5*m]<<16) | (A[6*m]<<8) | A[7*m]; 

t = (x ^ (x >> 7)) & 0x00AA00AA; x = x ^ t ^ (t << 7); 

t = (y ^ (y >> 7)) & 0x00AA00AA; y = y ^ t ^ (t << 7); 

t = (x ^ (x >>14)) & 0x0000CCCC; x = x ^ t ^ (t <<14); 

t = (y ^ (y >>14)) & 0x0000CCCC; y = y ^ t ^ (t <<14); 

t = (x & 0xF0F0F0F0) | ((y >> 4) & 0x0F0F0F0F); 

y = ((x << 4) & 0xF0F0F0F0) | (y & 0x0F0F0F0F); x = t; 

B[0]=x>>24; B[n]=x>>16;B[2*n]=x>>8; B[3*n]=x; B[4*n]=y>>24; B[5*n]=y>>16; B[6*n]=y>>8; B[7*n]=y; }

The line
t = (x ^ (x >> 7)) & 0x00AA00AA; x = x ^ t ^ (t << 7);
is quite cryptic, for sure. It is swapping bits 1 and 8 (counting from the right), 3 and 10, 5 and 12, and so on, in word
x, while not moving bits 0, 2, 4, and so on. The swaps are done with the exclusive or method of bit swapping,
described on page 40. Word x, before and after the first round of swaps, is
0123 4567 89ab cdef ghij klmn opqr stuv 082a 4c6e 193b 5d7f goiq ksmu hpjr ltnv
To get a realistic comparison of these methods, the naive method described on page 109 was filled out into a
complete program similar to that of Figure 7-2. Both were compiled with the GNU C compiler to a target machine
that is very similar to the basic RISC. The resulting number of instructions, counting all load's, store's , addressing
code, prologs, and epilogs, is 219 for the naive code and 101 for Figure 7-2. (The prologs and epilogs were null,
except for a return branch instruction.) A version of the code of Figure 7-2 adapted to a 64-bit basic RISC (in which
x and y would be held in the same register) would be about 85 instructions.
The algorithm of Figure 7-2 runs from fine to coarse granularity, based on the lengths of the groups of bits that are
swapped. The method can also be run from coarse to fine granularity. To do this, first treat the 8x8-bit matrix as a
2x2 matrix whose elements are 4x4-bit matrices, and transpose the 2x2 matrix. Then treat each the four 4x4
submatrices as a 2x2 matrix whose elements are 2x2-bit matrices, and transpose each of the four 2x2 submatrices,

只看该作者 · 2012-4-5 07:51

你是单单说速度最快，还是说指令效率最高？

1万 · 2012-4-5 08:27

这种情况下51是比较快的

2万 · 2012-4-5 09:14

经常用到类似的处理，比如将8个位高低互换。一般我查表，最快。

只看该作者 · 2012-4-5 10:20

不是单字节旋转180度，而是把8个bit0取出来组成一个新的字节，依次处理8个位。实现很简单，不知道有没有哪种CPU有高效的指令。像楼上说的旋转180度，CM3一条指令就搞定了。印象里论坛提起过这问题，记得有人说某DSP有针对性的指令。我这个应用是LED矩阵的灰度控制，计算出每个LED的灰度值之后，所有LED需要从低到高逐位控制驱动电路，速度要求比较高，常规处理刷新率上不去，不想用CPLD。

只看该作者 · 2012-4-5 12:41

如果空间够的话查表最快

1万 · 2012-4-5 13:05

51 的位指令的确可以体现不错的处理能力. 但是程序量很大(估计200多字节),看起来很多(因为不能用循环).
CM3 的移位指令很强悍, 单周期可以移多位, 完成 128 次移位操作, 就完成了这个任务. 全部在寄存器里捣鼓, 另外可以使用循环, 估计代码长度不会达到 200 字节. 当然有没有更聪明的操作方式, 暂时不好说.

2万 · 2012-4-5 13:20

51的位处理指令足够强悍了

只看该作者 · 2012-4-5 13:42

如果是32位处理器，可以把8x8bits转化成两个32bit数据，用位运算来做变换，可以快一点

2万 · 2012-4-5 14:12

看明白了，LZ是类似要将汉字字库旋转90度变成竖排在黑白LCD上显示那样的，我会先转换好字库，用51也可以转，然后存在SPI接口的FLASH中，或者传回PC保存成文件。

1万 · 2012-4-5 14:35

51的一位处理
moc c,acc.7
mov b.5,c

只看该作者 · 2012-4-5 14:53

0.j = 0;
1.i=0
2.取水平字节Hi;
3.算术右移一位;
4.垂直字节Vj算术左移;
5.i < 8? to->2 : to->6;
6.j < 8? to->1 : to->7;
7;over

相当于要执行一次8*8的循环,所需指令周期500-600左右

只看该作者 · 2012-4-5 15:32

把8个字节放在20H～27H中
MOV C, 20H.0
MOV ACC.0, C
MOV C, 21H.1
MOV ACC.1, C
MOV C, 22H.2
MOV ACC.2, C
MOV C, 23H.3
MOV ACC.3, C
MOV C, 24H.4
MOV ACC.4, C
MOV C, 25H.5
MOV ACC.5, C
MOV C, 26H.6
MOV ACC.6, C
MOV C, 27H.7
MOV ACC.7, C
结果在A中。
若用C8051，需要32个时钟，程序量为32字节。

2万 · 2012-4-5 16:08

楼上
你这只是处理一个字节而已

只看该作者 · 2012-4-5 16:23

呵呵，只是举了1个例子，方法都类似。

只看该作者 · 2012-4-5 16:31

:L我表示还没看明白题

只看该作者 · 2012-4-5 16:48

LED需要从低到高逐位控制驱动电路

你跟驱动电路是怎么通信的，速度是多少？建议你自己模拟通信直接按位输出，不要重新组合

只看该作者 · 2012-4-5 19:13

用CPLD（FPGA）做一个硬件并串转换。（8字节并行写入，串行输出）

只看该作者 · 2012-4-5 20:11

以上是《Hackers Delight》的部分章节，全书用电驴搜索并下载到。

哪种mcu指令结构处理以下情况最快？

相关下载

相关帖子

无冕之王奖章

湍急之河流

七世轮回

技术领袖奖章

技术导师奖章

突出贡献奖章

沉静之湖泊

十世金身

伴坛终老

技术奇才奖章

荣誉元老奖章

坚毅之洋流

精英会员奖章

奔腾之江水

缘定三生

核心会员奖章

社区建设奖章