my image processing project works with grayscale images. I have ARM Cortex-A8 processor platform. I want to make use of the NEON.
I have a grayscale image( consider
Load the 4 bytes using a single-lane load instruction (vld1 vmovl) to promote them first to 16 and then to 32 bit. The result should be something like (in GNU syntax)
vld1 d0[0], [] @Now d0 = (*, *, *, *, , ... )
vmovl.u8 q0, d0 @Now q1 = (d0, d1) = ((uint16_t)*, ... (uint16_t)*, , ... )
vmovl.u16 q0, d2 @Now d0 = ((uint32_t)*, ... (uint32_t)*), d1 = (, ... )
If you can guarantee that is 4-byte aligned, then write [: 32] instead in the load instruction, to save a cycle or two. If you do that and the address isn't aligned, you'll get a fault, however.
Um, I just realized you want to use intrinsics, not assembly, so here's the same thing with intrinsics.
uint32x4_t v8; // Will actually hold 4 uint8_t
v8 = vld1_lane_u32(ptr, v8, 0);
const uint16x4_t v16 = vget_low_u16(vmovl_u8(vreinterpret_u8_u32(v8)));
const uint32x4_t v32 = vmovl_u16(v16);