This paper presents a fast and efficient method of computing CRC on IA processors with generic polynomials using the carry-less multiplication instruction – PCLMULQDQ.
Instead of reducing the entire message with traditional reduction algorithms, we use a faster folding approach to reduce an arbitrary length buffer to a small fixed size to be reduced further by traditional methods such as Barrett reduction.
Parallelized folding approach is used to maximize the throughput of PCLMULQDQ instructions. We show how to do this efficiently for data buffers of arbitrary length.
The final reduction part is only slightly different for different sized polynomials (e.g., a 32-bit CRC and a 16-bit CRC). With our novel folding methods, CRC computation using PCLMULQDQ is faster than best software routines that don’t use the instruction, on a range of IA processor cores.
This paper will enable customers to code and optimize any CRC application for maximum performance on Westmere. We use real examples in the paper to illustrate the methods.