Replacing memcpy with neon intrinsics

北战南征 提交于 2021-02-20 04:26:28

问题


I am trying to beat the "memcpy" function by writing the neon intrinsics for the same . Below is my logic :

uint8_t* m_input;  //Size as 400 x300
uint8_t* m_output; //Size as 400 x300
//not mentioning the complete code base for memory creat 

memcpy(m_output, m_input, sizeof(m_output[0]) * 300* 400);

Neon:

int32_t ht_index,wd_index;
uint8x16_t vector8x16_image;

for(int32_t htI =0;htI < m_roiHeight;htI++){
    ht_index = htI * m_roiWidth ;

    for(int32_t wdI = 0;wdI < m_roiWidth;wdI+=16){
        wd_index = ht_index + wdI;
        vector8x16_image = vld1q_u8(m_input);

        vst1q_u8(&m_output[wd_index],vector8x16_image);
    }
}

I verified multiple times these result on imx6 hardware.

Results:

Memcpy :0.039 milisec
neon memcpy: 0.02841 milisec

I READ SOMEWHERE THAT WITHOUT PRELOADED INSTRUCTIONS WE CAN NOT BEAT MEMCPY.

If it is true then how my code is giving these results . Is it right or wrong


回答1:


If correctly written, a non-NEON memcpy() should be able to saturate the L3 bandwidth on your device, but for smaller transfers (fitting entirely within L1 or L2 cache) things can be different. Your test probably fits within L2 cache.

Unfortunately memcpy has to work for any sized call, so it can't reasonably optimise for in-cache and out-of-cache cases at the same time as optimising for very short copies where the cost of detecting what kind of optimisation would be best turns out to be the dominant factor.

Even so, it's possible that your test isn't fair. You have to be sure that both implementations aren't subject to different cache preconditions or different virtual page layout.

Make sure neither test is run entirely before the other. Test some of one implementation, then test some of the other, then back to the first and back to the second a few times, to make sure they're not subject to any warm-up conditions. And use the same buffers for both to ensure that there's no characteristic of different parts of your virtual address space that harms one implementation only.

Also, there are cases your memcpy doesn't handle, but these shouldn't matter much for large transfers.



来源:https://stackoverflow.com/questions/30107385/replacing-memcpy-with-neon-intrinsics

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!