Are there any modern CPUs where a cached byte store is actually slower than a word store?

后端 未结 2 1724
一整个雨季
一整个雨季 2020-11-27 23:14

It\'s a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register.

2条回答
  •  攒了一身酷
    2020-11-28 00:08

    cortex-m7 trm, cache ram section of the manual.

    In an error-free system, the major performance impact is the cost of the read-modify-write scheme for non-full stores in the data side. If a store buffer slot does not contain at least a full 32-bit word, it must read the word to be able to compute the check bits. This can occur because software only writes to an area of memory with byte or halfword store instructions. The data can then be written in the RAM. This additional read can have a negative impact on performance because it prevents the slot from being used for another write.

    .

    The buffering and outstanding capabilities of the memory system mask part of the additional read, and it is negligible for most codes. However, ARM recommends that you use as few cacheable STRB and STRH instructions as possible to reduce the performance impact.

    I have cortex-m7s but to date have not performed a test to demonstrate this.

    What is meant by "read the word", it is a read of one storage location in an SRAM that is part of the data cache. It is not a high level system memory thing.

    The guts of the cache is built of and around SRAM blocks that are the fast SRAM that makes a cache what it is, faster than system memory, fast to return answers back to the processor, etc. This read-modify-write (RMW) is not a high level write policy thing. What they are saying is if there is a hit and the write policy says to save the write in the cache then the byte or halfword needs to be written to one of these SRAMs. The width of the data cache data SRAM with ECC as shown in this document is 32+7 bits wide. 32 bits of data 7 bits of ECC check bits. You have to keep all 39 bits together for ECC to work. By definition you cant modify only some of the bits as that would result in an ECC fault.

    Whenever any number of bits need to change in that 32 bit word stored in the data cache data SRAM, 8, 16, or 32 bits, the 7 check bits have to be recomputed and all 39 bits written at once. For an 8 or 16 bit, STRB or STRH write, the 32 data bits need to be read the 8 or 16 bits modified with the remaining data bits in that word unchanged, the 7 ECC check bits computed and the 39 bits written to the sram.

    The computation of the check bits is ideally/likely within the same clock cycle that sets up the write, but the read and write are not in the same clock cycle so it should take at least two separate cycles to write data that arrived at the cache in one clock cycle. There are tricks to delay the write which sometimes can also hurt but usually moves it to a cycle that would have been unused and makes it free if you will. But it wont be the same clock cycle as the read.

    They are saying if you hold your mouth right and manage to get enough smaller stores hit the cache fast enough they will stall the processor until they can catch up.

    The document also describes the without ECC SRAM as being 32 bits wide, which implies this is also true when you compile the core without ECC support. I dont have access to the signals for this memory interface nor documentation so I cant say for sure but if it is implemented as a 32 bit wide interface without byte lane controls then you have the same issue, it can only write a whole 32 bit item to this SRAM and not fractions so to change 8 or 16 bits you have to RMW, down in the bowels of the cache.

    The short answer to why not use narrower memory is, size of chip, with ECC the size doubles as there is a limit on how few check bits you can use even with the width getting smaller (7 bits for every 8 bits is a lot more bits to save than 7 bits for every 32). The narrower the memory you also have a lot more signals to route and cant pack the memory as densely. An apartment vs a bunch of individual houses to hold the same number of people. Roads and sidewalks to the front door instead of hallways.

    And esp with a single core processor like this unless you intentionally try (which I will) it is unlikely you will accidentally hit this and why drive the cost of the product up on an: it-probably-wont-happen?

    Note even with a multi-core processor you will see the memories built like this.

    EDIT.

    Okay got around to a test.

    0800007c :
     800007c:   b430        push    {r4, r5}
     800007e:   6814        ldr r4, [r2, #0]
    
    08000080 :
     8000080:   6803        ldr r3, [r0, #0]
     8000082:   6803        ldr r3, [r0, #0]
     8000084:   6803        ldr r3, [r0, #0]
     8000086:   6803        ldr r3, [r0, #0]
     8000088:   6803        ldr r3, [r0, #0]
     800008a:   6803        ldr r3, [r0, #0]
     800008c:   6803        ldr r3, [r0, #0]
     800008e:   6803        ldr r3, [r0, #0]
     8000090:   6803        ldr r3, [r0, #0]
     8000092:   6803        ldr r3, [r0, #0]
     8000094:   6803        ldr r3, [r0, #0]
     8000096:   6803        ldr r3, [r0, #0]
     8000098:   6803        ldr r3, [r0, #0]
     800009a:   6803        ldr r3, [r0, #0]
     800009c:   6803        ldr r3, [r0, #0]
     800009e:   6803        ldr r3, [r0, #0]
     80000a0:   3901        subs    r1, #1
     80000a2:   d1ed        bne.n   8000080 
     80000a4:   6815        ldr r5, [r2, #0]
     80000a6:   1b60        subs    r0, r4, r5
     80000a8:   bc30        pop {r4, r5}
     80000aa:   4770        bx  lr
    

    there is a load word (ldr), load byte (ldrb), store word (str) and store byte (strb) versions of each, each are aligned on at least 16 byte boundaries as far as the top of loop address.

    with icache and dcache enabled

        ra=lwtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
        ra=lwtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
        ra=lbtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
        ra=lbtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
        ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
        ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
        ra=sbtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
        ra=sbtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
    
    
    0001000B                                                                        
    00010007                                                                        
    0001000B                                                                        
    00010007                                                                        
    0001000C                                                                        
    00010007                                                                        
    0002FFFD                                                                        
    0002FFFD  
    

    the loads are on par with each other as expected, the stores though, when you bunch them up like this, a byte write is 3 times longer than a word write.

    but if you dont hit the cache that hard

    0800019c :
     800019c:   b430        push    {r4, r5}
     800019e:   6814        ldr r4, [r2, #0]
    
    080001a0 :
     80001a0:   7003        strb    r3, [r0, #0]
     80001a2:   46c0        nop         ; (mov r8, r8)
     80001a4:   46c0        nop         ; (mov r8, r8)
     80001a6:   46c0        nop         ; (mov r8, r8)
     80001a8:   7003        strb    r3, [r0, #0]
     80001aa:   46c0        nop         ; (mov r8, r8)
     80001ac:   46c0        nop         ; (mov r8, r8)
     80001ae:   46c0        nop         ; (mov r8, r8)
     80001b0:   7003        strb    r3, [r0, #0]
     80001b2:   46c0        nop         ; (mov r8, r8)
     80001b4:   46c0        nop         ; (mov r8, r8)
     80001b6:   46c0        nop         ; (mov r8, r8)
     80001b8:   7003        strb    r3, [r0, #0]
     80001ba:   46c0        nop         ; (mov r8, r8)
     80001bc:   46c0        nop         ; (mov r8, r8)
     80001be:   46c0        nop         ; (mov r8, r8)
     80001c0:   3901        subs    r1, #1
     80001c2:   d1ed        bne.n   80001a0 
     80001c4:   6815        ldr r5, [r2, #0]
     80001c6:   1b60        subs    r0, r4, r5
     80001c8:   bc30        pop {r4, r5}
     80001ca:   4770        bx  lr
    

    then the word and byte take the same amount of time

        ra=nwtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
        ra=nwtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
        ra=nbtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
        ra=nbtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
    
    0000C00B                                                                        
    0000C007                                                                        
    0000C00B                                                                        
    0000C007
    

    it still takes 4 times as long to do bytes vs words all other factors held constant, but that was the challenge to have bytes take more than 4 times as long.

    so as I was describing before this question, that you will see the srams being an optimal width in the cache as well as other places and byte writes are going to suffer a read-modify-write. Now whether or not that is visible do to other overhead or optimizations or not is another story. ARM clearly stated it may be visible, and I feel that I have demonstrated this. This is not a negative to ARM's design in any way, in fact the other way around, RISC moves overhead in general as far as the instruction/execution side goes, it does take more instructions to do the same task. Efficiencies in the design allow for things like this to be visible. There are whole books written on how to make your x86 go faster, dont do 8 bit operations for this or that, or other instructions are preferred, etc. Which means you should be able to write a benchmark to demonstrate those performance hits. Just like this one, even if computing each byte in a string as you move it to memory this should be hidden, you need to write code like this and if you were going to do something like this you might consider burning the instructions combining the bytes into a word before doing the write, may or may not be faster...depends.

    If I had halfword (strh) then no surprise, it also suffers the same read-modify-write as the ram is 32 bits wide (plus any ecc bits if any)

    0001000C   str                                                                      
    00010007   str                                                                      
    0002FFFD   strh                                                                     
    0002FFFD   strh                                                                     
    0002FFFD   strb                                                                     
    0002FFFD   strb
    

    the loads take the same amount of time as the sram width is read as a whole and put on the bus, the processor extracts the byte lanes of interest from that, so there is no time/clock cost to doing that.

提交回复
热议问题