Is accessing statically or dynamically allocated memory faster?

后端 未结 5 846
无人及你
无人及你 2021-01-07 23:27

There are 2 ways of allocating global array in C:

  1. statically

    char data[65536];
    
  2. dynamically

    char *d         
    
    
            
5条回答
  •  温柔的废话
    2021-01-07 23:58

    One should always benchmark to be sure. But, ignoring the effects of cache for the moment, the efficiency can depend on how sporadically you access the two. Herein, consider char data_s[65536] and char *data_p = malloc(65536)


    If the access is sporadic the static/global will be slightly faster:

    // slower because we must fetch data_p and then store
    void
    datasetp(int idx,char val)
    {
    
        data_p[idx] = val;
    }
    
    // faster because we can store directly
    void
    datasets(int idx,char val)
    {
    
        data_s[idx] = val;
    }
    

    Now, if we consider caching, datasetp and datasets will be about the same [after the first access], because the fetch of data_p will be fulfilled from cache [no guarantee, but likely], so the time difference is much less.


    However, when accessing the data in a tight loop, they will be about the same, because the compiler [optimizer] will prefetch data_p at the start of the loop and put it in a register:

    void
    datasetalls(char val)
    {
        int idx;
    
        for (idx = 0;  idx < 65536;  ++idx)
            data_s[idx] = val;
    }
    
    void
    datasetallp(char val)
    {
        int idx;
    
        for (idx = 0;  idx < 65536;  ++idx)
            data_p[idx] = val;
    }
    
    void
    datasetallp_optimized(char val)
    {
        int idx;
        register char *reg;
    
        // the optimizer will generate the equivalent code to this
        reg = data_p;
    
        for (idx = 0;  idx < 65536;  ++idx)
            reg[idx] = val;
    }
    

    If the access is so sporadic that data_p gets evicted from the cache, then, the performance difference doesn't matter so much, because access to [either] array is infrequent. Thus, not a target for code tuning.

    If such eviction occurs, the actual data array will, most likely, be evicted as well.

    A much larger array might have more of an effect (e.g. if instead of 65536, we had 100000000, then mere traversal will evict data_p and by the time we reached the end of the array, the leftmost entries would already be evicted.

    But, in that case, the extra fetch of data_p would be 0.000001% overhead.

    So, it helps to either benchmark [or model] the particular use case/access pattern.


    UPDATE:

    Based on some further experimentation [triggered by a comment by Peter], the datasetallp function does not optimize to the equivalent of datasetallp_optimized for certain conditions, due to strict aliasing considerations.

    Because the arrays are char [or unsigned char], the compiler generates a data_p fetch on each loop iteration. Note that if the arrays are not char (e.g. int), the optimization does occur and data_p is fetched only once, because char can alias anything but int is more limited.

    If we change char *data_p to char *restrict data_p we get the optimized behavior. Adding restrict tells the compiler that data_p will not alias anything [even itself], so it's safe to optimize the fetch.


    Personal note: While I understand this, to me, it seems goofy that without restrict, the compiler must assume that data_p can alias back to itself. Although I'm sure there are other [equally contrived] examples, the only ones I can think of are data_p pointing to itself or that data_p is part of a struct:

    // simplest
    char *data_p = malloc(65536);
    data_p = (void *) &data_p;
    
    // closer to real world
    struct mystruct {
        ...
        char *data_p;
        ...
    };
    struct mystruct mystruct;
    mystruct.data_p = (void *) &mystruct;
    

    These would be cases where the fetch optimization would be wrong. But, IMO, these are distinguishable from the simple case we've been dealing with. At least, the struct version. And, if a programmer should do the first one, IMO, they get what they deserve [and the compiler should allow fetch optimization].


    For myself, I always hand code the equivalent of datasetallp_optimized [sans register], so I usually don't see the multifetch "problem" [if you will] too much. I've always believed in "giving the compiler a helpful hint" as to my intent, so I just do this axiomatically. It tells the compiler and another programmer that the intent is "fetch data_p only once".


    Also, the multifetch problem does not occur when using data_p for input [because we're not modifying anything, aliasing is not a consideration]:

    // does fetch of data_p once at loop start
    int
    datasumallp(void)
    {
        int idx;
        int sum;
    
        sum = 0;
        for (idx = 0;  idx < 65536;  ++idx)
            sum += data_p[idx];
    
        return sum;
    }
    

    But, while it can be fairly common, "hardwiring" a primitive array manipulation function with an explicit array [either data_s or data_p] is often less useful than passing the array address as an argument.

    Side note: clang would optimize some of the functions using data_s into memset calls, so, during experimentation, I modified the example code slightly to prevent this.

    void
    dataincallx(array_t *data,int val)
    {
        int idx;
    
        for (idx = 0;  idx < 65536;  ++idx)
            data[idx] = val + idx;
    }
    

    This does not suffer from the multifetch problem. That is, dataincallx(data_s,17) and dataincallx(data_p,37) work about the same [with the initial extra data_p fetch]. This is more likely to be what one might use in general [better code reuse, etc].


    So, the distinction between data_s and data_p becomes a bit more of a moot point. Coupled with judicious use of restrict [or using types other than char], the data_p fetch overhead can be minimized to the point where it isn't really that much of a consideration.

    It now comes down more to architectural/design choices of choosing a fixed size array or dynamically allocating one. Others have pointed out the tradeoffs.

    This is use case dependent.

    If we had a limited number of array functions, but a large number of different arrays, passing the array address to the functions is a clear winner.

    However, if we had a large number of array manipulation functions and [say] one array (e.g. the [2D] array is a game board or grid), it might be better that each function references the global [either data_s or data_p] directly.

提交回复
热议问题