Linux slab allocator and cache performance

问题

From the guide understanding linux kernel 3rd edition, chapter 8.2.10, Slab coloring-

We know from Chapter 2 that the same hardware cache line maps many different blocks of RAM. In this chapter, we have also seen that objects of the same size end up being stored at the same offset within a cache. Objects that have the same offset within different slabs will, with a relatively high probability, end up mapped in the same cache line. The cache hardware might therefore waste memory cycles transferring two objects from the same cache line back and forth to different RAM locations, while other cache lines go underutilized. The slab allocator tries to reduce this unpleasant cache behavior by a policy called slab coloring : different arbitrary values called colors are assigned to the slabs.

(1) I am unable to understand the issue that the slab coloring tries to solve. When a normal proccess accesses data, if it is not in the cache and a cache miss is encountered, the data is fetched into the cache along with data from the surounding address of the data the process tries to access to boost performance. How can a situation occur such that same specific cache lines keeps getting swapped? the probability that a process keeps accessing two different data addresses in same offset inside a memory area of two different memory areas is very low. And even if it does happen, cache policies usually choose lines to be swapped according to some agenda such as LRU, Random, etc. No policy exist such that chooses to evict lines according to a match in the least significant bits of the addresses being accessed.

(2) I am unable to understand how the slab coloring, which takes free bytes from end of slab to the beginning and results with different slabs with different offsets for the first objects, solve the cache-swapping issue?

[SOLVED] after a small investigation I believe I found an answer to my question. Answer been posted.

回答1:

I think I got it, the answer is related to Associativity.

A cache can be divided to certain sets, each set can only cache a certain memory blocks type in it. For example, set0 will contain memory blocks with addresses of multiple of 8, set1 will contain memory blocks with addresses of multiple of 12. The reason for that is to boost cache performance, to avoid the situation where every address is searched throught the whole cache. This way only a certain set of the cache needs to be searched.

Now, from the link Understanding CPU Caching and performance

From page 377 of Henessey and Patterson, the cache placement formula is as follows: (Block address) MOD (Number of sets in cache)

Lets take memory block address 0x10000008 (from slabX with color C) and memory block address 0x20000009 (from slabY with color Z). For most N (number of sets in cache), the calculation for <address> MOD <N> will yield a different value, hence a different set to cache the data. If the addresses were with same least significant bits values (for example 0x10000008 and 0x20000008) then for most of N the calculation will yield same value, hence the blocks will collide to the same cache set.

So, by keeping an a different offset (colors) for the objects in different slabs, the slabs objects will potentially reach different sets in cache and will not collide to the same set, and overall cache performance is increased.

EDIT: Furthermore, if the cache is a direct mapped one, then according to wikipedia, CPU Cache, no cache replacement policy exist and the modulu calculation yields the cache block to which the memory block will be stored:

Direct-mapped cache In this cache organization, each location in main memory can go in only one entry in the cache. Therefore, a direct-mapped cache can also be called a "one-way set associative" cache. It does not have a replacement policy as such, since there is no choice of which cache entry's contents to evict. This means that if two locations map to the same entry, they may continually knock each other out. Although simpler, a direct-mapped cache needs to be much larger than an associative one to give comparable performance, and it is more unpredictable. Let x be block number in cache, y be block number of memory, and nbe number of blocks in cache, then mapping is done with the help of the equation x = y mod n.

回答2:

After many studying and thinking, I have got explanation seemingly more reasonable, not only by specific address examples. Firstly, you must learn basics knowledge such as cache , tag, sets , line allocation.

It is certain that colour_off's unit is cache_line_size from linux kernel code. colour_off is the basic offset unit and colour is the number of colour_off in struct kmem_cache.

int  __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
   cachep->align = ralign;
   cachep->colour_off = cache_line_size();  // colour_off's unit is cache_line_size
    /* Offset must be a multiple of the alignment. */
   if (cachep->colour_off < cachep->align)
      cachep->colour_off = cachep->align;
   .....
   err = setup_cpu_cache(cachep, gfp);

https://elixir.bootlin.com/linux/v4.6/source/mm/slab.c#L2056

So we can analyse it in two cases. The first is cache > slab. You see slab 1 slab2 slab3 ... has no possibility to collide mostly because cache is big enough except slab1 vs slab5 which can collide. So colouring mechanism is not so clear to improve performance in the case. But with slab1 and slab5 we just ignore to explain it why, I am sure you will work it out after reading the following.

The second is slab > cache. A blank line means a color_off or cache line. Clearly, slab1 and slab2 has no possibility to collide on the lines signed by tick as well as slab2 slab3. We make sure colouring mechanism optimize two lines between two adjacent slabs, much less slab1 vs slab3 which optimize more lines, 2+2 = 4 lines, you can count it.

To summarize, colouring mechanism optimize cache performance (detailly just optimize some lines of colour_off at the beginning and end, not other lines which can still collide ) by using originally useless memory as possible as it can.

回答3:

Say you have a 256 KB cache and it uses a super-simple algorithm where it does cache line = (real address AND 0x3FFFFF).

Now if you have slabs starting on each megabyte boundary then item 20 in Slab 1 will kick Item 20 of Slab 2 out of cache because they use the same cache line tag.

By offsetting the slabs it becomes less likely that different slabs will share the same cache line tag. If Slab 1 and Slab 2 both hold 32 byte objects and Slab 2 is offset 8 bytes, its cache tags will never be exactly equal to Slab 1's.

I'm sure I have some details wrong, but take it for what it's worth.

来源：https://stackoverflow.com/questions/46731933/linux-slab-allocator-and-cache-performance

标签

Linux

memory

memory-management

linux-kernel

slab