问题
Intel has a specific CRC32 instruction available in the SSE4.2 instruction set. How can I take advantage of this instruction to speed up CRC32 calculations?
回答1:
First of all the Intel's CRC32
instruction serves to calculate CRC-32C
(that is uses a different polynomial that regular CRC32. Look at the Wikipedia CRC32 entry)
To use Intel's hardware acceleration for CRC32C using gcc
you can:
- Inline assembly language in C code via the
asm
statement - Use intrinsics
_mm_crc32_u8
,_mm_crc32_u16
,_mm_crc32_u32
or_mm_crc32_u64
. See Intel Intrinsics Guide for a description of those for the Intel's compilericc
butgcc
also implements them.
This is how you would do it with __mm_crc32_u8
that takes one byte at a time, using __mm_crc32_u64
would give further performance improvement since it takes 8 bytes at a time.
uint32_t sse42_crc32(const uint8_t *bytes, size_t len)
{
uint32_t hash = 0;
size_t i = 0;
for (i=0;i<len;i++) {
hash = _mm_crc32_u8(hash, bytes[i]);
}
return hash;
}
To compile this you need to pass -msse4.2
in CFLAGS
. Like gcc -g -msse4.2 test.c
otherwise it will complain about undefined reference to _mm_crc32_u8
.
If you want to revert to a plain C implementation if the instruction is not available in the platform where the executable is running you can use GCC's ifunc
attribute. Like
uint32_t sse42_crc32(const uint8_t *bytes, size_t len)
{
/* use _mm_crc32_u* here */
}
uint32_t default_crc32(const uint8_t *bytes, size_t len)
{
/* pure C implementation */
}
/* this will be called at load time to decide which function really use */
/* sse42_crc32 if SSE 4.2 is supported */
/* default_crc32 if not */
static void * resolve_crc32(void) {
__builtin_cpu_init();
if (__builtin_cpu_supports("sse4.2")) return sse42_crc32;
return default_crc32;
}
/* crc32() implementation will be resolved at load time to either */
/* sse42_crc32() or default_crc32() */
uint32_t crc32(const uint8_t *bytes, size_t len) __attribute__ ((ifunc ("resolve_crc32")));
回答2:
See this answer for fast hardware and software implementations of CRC-32C. The hardware implementation effectively runs three crc32
instructions in parallel for speed.
来源:https://stackoverflow.com/questions/31184201/how-to-implement-crc32-taking-advantage-of-intel-specific-instructions