cached data,反之则表示有效。其中L1P没有LRU位!
下图为程序地址在Cache控制器内的结构:
图8 C64x Memory Address from Cache Controller
图9 C621x/C671x L1P Address Allocation
如图8所示,当CPU要读某个地址处的指令时,该地址在Cache中被分为3个部分。其中5到13位表示了该地址映射到
哪一组set(对于直接映射Cache,set=line frame号),Cache控制器再检查有效位Valid和Tag Comparison(14到31位,
因为multiple lines in memory are mapped to the same set in the cache),所作的具体操作见Figure1-6。如果最后结果为0,
则read miss。A read miss also means that a line frame will be allocated for the line containing the requested address.
由于L1P是读分配(read allocate),这时L1P控制器会从L2或内存中取出该指令包,并放入L1P Cache相应的line frame中,
tag被设定,并且V=1表明该set包含有效的数据,同时该指令包也会被送入CPU,此时该过程结束。
注意,利用Cache最重要的是保证只要一条line frame的内容还有用,就不要取代它,即最大化line的重复使用率Maximize cache line reuse。
解决Cache miss问题的一种方案就是建立包含多条line frame的set,即L1D Cache使用的原理2-Way Set Associative Cache。
这样内存中多条具有相同Set值的地址下的指令就可以同时存在于Cache中而不会发生冲突,从而使命中率提高。
L1D Cache(2-Way Set-Associative Cache)
L1D Cache是2-Way Set Associative Cache,大小为16KB(4KB for C671x)。L1D Cache的Line Size 是64 bytes(32 bytes
for C621x/C671x),L1D Cache 是2-Way Set Associative Cache的意思就是同一区块,L1D Cache有两个入口可以访问,相比
Direct Mapped Cache,L1D就可以大大减少Cache Conflict Miss发生的机会。同样,L1D Cache Miss 也会打乱指令流水。相对于
L1P Cache还有不同的是L1D Cache是可写的,这就有可能会涉及到写延迟,下文再仔细讨论。
//
C64x的L1D是两路组相联2-way Set-Associative的cache,下表为其基本参数:
表5 L1D Characteristics
2-way set-associative cache的每个set包含两条line frames,一条在way 0,另一条在way 1。内存中的a line实际上还是映
射到cache中的one set中,但可以使way0的line frame,也可以使way1的line frame。而直接映射Cache可以看做1-way cache。
从上表中可以看出C64x的L1D每条line frame为64字节(32 bytes for C621x/C671x)。并且是Read Allocate,Write-back的,
这里对此先不做解释。上表还说明当write miss时,L1D通过4*64-bit(4*32-bit for C671x)的write buffer向下级存储器写入数据。
图10 C621x/C671x L1D Address Allocation
下图为set-associative cache的结构图,图中说明了其hit和miss的确定方法,这其实和direct mapped cache差不多,
先解析该地址对应哪一组set,只是有两次tag比较(确定数据保存在哪一路),一次Valid bit有效位比较,根据LRU位来确定需要
的数据保存在那一路,根据Dirty位是否为1决定是否写回。
图11 C64x L1D Architecture
当两路中都不包含所需数据即read miss时,就要从内存中取了,而取到的数据应该放到哪一路呢?这就需要LRU(least-recently-used)
位,用来指示哪些路最近最少使用。Cache中每一个set都有一个LRU位,当LRU=0,则数据被映射到way0的line frame中;如果LRU=1,
则数据被映射到way1的line frame中。每次访问到某个set中的某路line frame时,如果hit,LRU位的状态都会改变一次;When a way
is accessed,the LRU bit always switches to the opposite way,来保证most-recently-used位不被取代掉;而当read miss时,a set中的
LRU line frame被分配new line替代当前line,而上述机制由程序访问的局部性决定。注意LRU位只在miss发生时被参考,但是它的状态
却在每次line frame被访问时(不论miss或hit)都要作相应调整即更新。由于最近最少使用位LRU只记录不命中操作,但是它的状态每
次都会更新,无论对列访问是命中还是不命中,读或者写。
前面说到过,L1D是Read Allocate cache,意思是cache中的一条line frame只在read miss发生时才会作相应的重新映射操作;而write
miss时,只是将数据通过write buffer写入下一级内存中(L2 SRAM或者外部存储器),不通过L1D Cache,因为L1D不是write allocate,
这个write buffer包括4路64-bit(32-bit for C621x /C671x)的通路。
同时,因为L1D是Write-Back cache,所以当write hit发生时,数据只是写入cache,而不会立即写入对应的下级内存。这样,为了
在以后能正确地将修改过的数据写回下级内存,就必须知道cache中的哪条line frame被CPU修改过。所以cache中每条line frame都有一
个Dirty bit(D)与其相关。Dirty bit的初始值为0,当CPU修改过其数据,则被置为1。当某一条line frame以为read miss而需要被替代时,
则会检测其dirty bit,如果它是一条dirty line(D=1),则其内容会先写回victim buffer,再通过write back命令写回下级内存,然后新的
数据存储在line frame;否则(D=0),它的内容会被忽略。这种写回操作也能由程序向cache 控制器发送write back命令来实现。
L2包括L2 Cache和L2 SRAM
Level 2其实就是C64x 中的On Chip Memory。C64x 的On Chip L2 Memory的大小是1024k Bytes,可以被设置成为L2 SRAM 或L2
SRAM和L2 Cache的混合体。L2的Line Size 是128 bytes,可以是 1,2,3,4 Way Set Associative Cache,取决于分配的L2 Cache大小。
可以是0K,32K(1-Way),64K (2-Way),128K (3-Way) 或者 256K (4-Way)。L2 Cache 最大会使用256KB(64KB for C671x)的L2
SRAM,这样L2 SRAM最少还可以有1024–256 =768KB可以被程序使用,如何利用L2 SRAM 应该是程序员煞费苦心的地方。
//
表6 L2 Cache Characteristics
L1和L2 Cache像这样协同工作:当某个地址在L1中miss,则到L2中查找;L2使用相同的方法检测是否所需地址在L2中。反之,
如果L1 hit,则接下来的操作直接在L1中进行,与L2无关。
上表给出了C6000 DSP的L2 Cache 属性,其L2 存储空间可以分为可寻址的片上空间(L2 SRAM)和cache(L2 Cache)两部分。
与L1 Cache的read-allocate不同,L2 Cache是Read and write allocate。L2 cache只用来cache外部存储地址,而L1 Cache用来cache L2
SRAM和外部存储地址。L2使用Valid bit和tag comparison确定请求的地址是否在L2 cache中;LRU位确定line frame被分配在哪一路
way(read miss);Dirty bit表明line frame在new line取指之前首先写回到external memory(如果该line还包含在L1D中,它要首先
写回到L2中,保持cache的一致性Coherence)。
图12 C621x/C671x L2 Address Allocation (All L2 Cache Modes)
下面以CPU要求读取一个可缓冲的外部存储地址为例,说明L2 Cache工作过程。
1、在L1(L1P或L1D)中miss,并且在L2 Cache中miss。这时,外存中相应的line会被调入L2 cache,再由LRU bits决定被放置在哪条
line frame。如果该line frame包含Dirty data,则在被新的line取代时,会将其数据写回外存中相应位置。(如果该line frame也在L1D中,
则在L2 line写回外存前要先由L1写回L2,这一操作叫保持cache一致性)接着,这一line再被写为L1的形式,并交至L1 cache。L1 cache
在将其保存在其cache中,并交由CPU处理。注意,如果L1中放置该line的frame line含有Dirty data,同样要先写回L2 cache。
2、如果该地址在L2 Cache 中hit。则相应line直接调入L1中并交给CPU处理。
前面说过,L2 Cache是Read and write allocate,这是指当CPU要向外存中写数据时,如果L1和L2 cache miss,则会像读时那样,把对
应位置的line从外部存储器中调入到L2 Cache的line frame中,而这时所作的操作也与读时类似,如果含有Dirty data,则应先写回外存。
但应注意,这一line是不会出现在L1D中的,因为L1D Cache只是read-allocate,不是write allocate。如果L2写命中,L2 cache line frame直接被CPU的写数据更新。
L1P Cache:4KB,64bytes/line size
L1D Cache:
L2 Memory:包括L2 SRAM(可寻址的片上存储器)和L2 Cache(caching External Memory locations)
——Cache Misses的类型——
Capacity miss容量丢失:
解决方法①reduce the amount of data that is operated on at a time,减小操作所需的数据量;②the capacity of the cache can be increased,增加Cache的容量。
Conflict miss冲突丢失:
解决方法①change the memory layout;②we can create sets that can hold two or more lines。
Compulsory miss强制(必然)丢失:first-reference miss,这种缺失发生在第一次访问数据时,是不可避免的,除非系统对数据进行了预读取。
Cache Miss(read miss,write miss):
Cache Hit(read hit,write hit):
对于每一种不命中方式,Cache控制器在将数据从存储器放入cache中时都会产生延迟。为了得到更高的性能,每一列中的内容在被取代
之前应该尽可能的被重复利用。重复使用某列以此来获得不同的位置能够改善空间位置的访问,而重复使用某列可以改善时间位置的访问。
这就是优化cache存储性能的一个最基本的准则。Maximize cache line reuse
基于Cache的存储系统模型:Execution Time =↓Cache Cycle Count/↑CPU Clock Rate
Optimizing Cache Performance——熟悉cache memory architecture,特别cache memory的特性如line size, associativity, capacity, replacement
scheme, read/write allocation, miss pipelining, and write buffer.
Application-Level Optimizations:使用L2 Cache和DMA
Procedural-Level Optimizations:Data type减小memory带宽;chaining;避免L1P和L1D冲突丢失;避免L1D颠簸thrashing;避免容量丢失;
避免write buffer stall
L2 Access Conflict 访问冲突:L2 can only service one request 一次;优先级顺序是L1P read miss;L1D read or write miss;EDMA read or
write;Internal cache operations(victim writebacks, line fills, snoops),包括同时访问同一个bank发生访问冲突
L2 Bank Conflict 组冲突:Since an L2 access requires 2 cycles to complete, accesses to the same bank on consecutive cycles cause a stall.
即连续访问同一个bank发生组冲突
L1P的优化思路较为简单,主要的原则就是:尽量以循环、迭代方式实现算法,减小代码量。
Cache Flush有两个动作,将cache memory中的内容写回external memory,clean cache memory;Cache Clean动作只有一个,clean the cache memory
Execute packet执行包:may contain between 1 and 8 instructions.
Fetch packet取指包:A block of 8 instructions;One fetch packet may contain multiple execute packets
Miss pipelining(Pipelined Misses):The process of servicing a single cache miss is pipelined over several cycles;overlap the processing of several misses,
Associativity:The number of line frames in each set,or the number of ways
Long-distance access:CPU access external noncacheable memory
Line:A cache line is the smallest block of data that the cache operates on. 被操作的最小单元
Set组:A collection of line frames in a cache;A direct-mapped cache contains one line frame per set, and an N-way set-associative cache contains
N line frames per set. A fully-associative cache has only one set that contains all of the line frames in the cache.
Way:each set in the cache contains multiple line frames;The number of line frames in each set is referred to as the number of ways in the cache
Victim Buffer:A special buffer that holds victims until they are written back.
Write Buffer:
Write merging:combine multiple independent writes into a single, larger write.如用于DMA写和L1D write buffer 或victim buffer can merge multiple writes
Memory System Coherence
1、Cache Coherence Problem
Coherence between CPU and EDMA(外设)or host accesses:If any read of a data item returns the most recently written value of that data item;
A coherent memory system ensures that all writes to a given memory location are visible to future reads。
例如外设写,CPU读hit,Memory更新而Cache未更新;CPU写hit,外设读,Cache更新而Memory未更新,此时发生Cache和Memory不一致incoherence。
Consequently, if a memory location存储单元 is shared, cached, and has been modified, there is a Cache Coherence Problem发生的条件:
Multiple devices (CPUs, peripherals, DMA controllers) share a region of memory for the purpose of data exchange;
This memory region is cacheable by at least one device;
A memory location in this region has been cached;
And this memory location is modified (by any device)
2、Snoop Commands:低级的存储器检查请求的地址是否cached(valid)在高级的存储器中
L1D Snoop Command (C64x devices only):
Writes back a line from L1D to L2 SRAM/cache
Used for DMA reads of L2 SRAM
L1D Snoop-Invalidate Command:
Writes back a line from L1D to L2 SRAM/cache and invalidates it in L1D
Used for DMA writes to L2 SRAM and user-controlled cache operations
L1P Invalidate Command:
Invalidates a line in L1P
Used for DMA write of L2 SRAM and user-controlled cache operations
注意:DMA is not allowed to access addresses that map to L2 cache.
3、Cache Coherence Protocol:DMA Accesses to L2 SRAM
图13 DMA Write to L2 SRAM
*)If line is dirty it is first written back to L2 SRAM and merged with the new data written by the DMA.
DMA write:snoop命令包括L1D写回-使无效;L1P使无效
而DMA read:写回
*) A snoop command is sent on C64x DSP, the line is written back and kept valid.
On C621x/C671x DSP, a snoop–invalidate command is sent which additionaly invalidates the line in L1D.
图14 DMA Read of L2 SRAM
4、解决Cache Incoherence的方法:1) Clean or flush cache memory;2) Double buffering,即ping-pong buffering;3)
Disabling External Memory Caching
表7 Coherence Assurances in the Two-Level Memory System
Double-buffering即ping-pong buffering,have two sets of input and output buffers:one for CPU processing data and one for EDMA
transfers-in-progress. 4个Buffer:InBuffA and OutBuffA 以及InBuffB and OutBuffB,保持L1D和L2 SRAM的一致性;
双缓冲的例子程序:D:\CCStudio_v3.3\Cache Examples\DSKC6711\L2_double_buf
外部存储器双缓冲的例子程序:D:\CCStudio_v3.3\Cache Examples\DSKC6711\ext_double_buf
In addition to the coherence operations, it is important that all DMA buffers are aligned at an L2 cache line and are an integral
multiple of cache lines large. Cache控制器操作一直whole cache line,Block的大小应该是Cache Line的整数倍,并且边界对齐
图15 Double Buffering in L2 SRAM
注:DMA写L2 SRAM对L1D snoop-invalidate;DMA读L2 SRAM没有snoops,因为它没有在L1D中被cached
C621x/C671x and C64x DSPs automatically maintain cache coherence for accesses by the CPU and EDMA to L2 SRAM through
a hardware cache coherence protocol based on snoop commands. 以便保持L2 SRAM和L1D的一致性;
Whenever external memory caching is enabled and the EDMA is used to transfer to/from external memory, it is your responsibility
to maintain cache coherence. 即手动保持external memory和L2 Cache的一致性,以及保持L2 Cache和L1D的一致性(L2 Cache使能时)
DMA写之前写回使无效:CACHE_wbInvL2(InBuffB, BUFSIZE, CACHE_WAIT);
DMA读传输之前写回:CACHE_wbL2(OutBuffB, BUFSIZE, CACHE_WAIT);
表8 DMA Scenarios With Coherence Operation Required
Memory Access Ordering
The C6000 DSP cores may initiate up to two parallel memory operations per cycle.
表9 Program Order for Memory Operations Issued From a Single Execute Packet
原则:Load优先于Store;DA1优先于DA2;L2存储器按L1P、L1D、EDMA的顺序访问