Optimizing a reed-solomon encoder (polynomial division)

坚强是说给别人听的谎言 提交于 2019-12-03 10:46:34
DavidW

The following is 3x faster than pypy on my machine (0.04s vs 0.15s). Using Cython:

ctypedef unsigned char uint8_t # does not work with Microsoft's C Compiler: from libc.stdint cimport uint8_t
cimport cpython.array as array

cdef uint8_t[::1] gf_exp = bytearray([1, 3, 5, 15, 17, 51, 85, 255, 26, 46, 114, 150, 161, 248, 19,
   lots of numbers omitted for space reasons
   ...])

cdef uint8_t[::1] gf_log = bytearray([0, 0, 25, 1, 50, 2, 26, 198, 75, 199, 27, 104, 
    more numbers omitted for space reasons
    ...])

import cython

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
def rsenc(msg_in_r, nsym, gen_t):
    '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''

    cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object
    cdef int[::1] gen = array.array('i',gen_t) # convert list to array

    cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
    cdef int j
    cdef uint8_t[::1] lgen = bytearray(gen.shape[0])
    for j in xrange(gen.shape[0]):
        lgen[j] = gf_log[gen[j]]

    cdef uint8_t coef,lcoef

    cdef int i
    for i in xrange(msg_in.shape[0]):
        coef = msg_out[i]
        if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
            lcoef = gf_log[coef] # precaching

            for j in xrange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
                msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j]

    # Recopy the original message bytes
    msg_out[:msg_in.shape[0]] = msg_in
    return msg_out

It is just your fastest version with static types (and checking the html from cython -a until the loops aren't highlighted in yellow).

A few brief notes:

  • Cython prefers x.shape[0] to len(shape)

  • Defining the memoryviews as [::1] promises they are continuous in memory, which helps

  • initializedcheck(False) is necesary for avoiding lots of existance checks on the globally defined gf_exp and gf_log. (You might find you can speed up your basic Python/PyPy code by creating a local variable reference for these and using that istead)

  • I had to copy a couple of the input arguments. Cython can't make a memoryview from a readonly object (in this case msg_in, a string. I could probably have just made it a char* though). Also gen (a list) needs to be in something with fast element access.

Other than that it's all fairly straight-forward. (I haven't tried any variations of it having got it faster). I'm really quite impressed at how well PyPy does.

Building on DavidW's answer, here's the implementation I am currently using, which is about 20% faster by using nogil and parallel computation:

from cython.parallel import parallel, prange

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
cdef rsenc_cython(msg_in_r, nsym, gen_t) :
    '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''

    cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object
    #cdef int[::1] gen = array.array('i',gen_t) # convert list to array
    cdef uint8_t[::1] gen = gen_t

    cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
    cdef int i, j
    cdef uint8_t[::1] lgen = bytearray(gen.shape[0])
    for j in xrange(gen.shape[0]):
        lgen[j] = gf_log_c[gen[j]]

    cdef uint8_t coef,lcoef
    with nogil:
        for i in xrange(msg_in.shape[0]):
            coef = msg_out[i]
            if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
                lcoef = gf_log_c[coef] # precaching

                for j in prange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
                    msg_out[i + j] ^= gf_exp_c[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j]

    # Recopy the original message bytes
    msg_out[:msg_in.shape[0]] = msg_in
    return msg_out

I would still like it to be faster (on a real implementation, data is encoded at about 6.4 MB/s with n=255, n being the size of the message+codeword).

The main lead to a faster implementation that I have found is to use a LUT (LookUp Table) approach, by precomputing the multiplication and addition arrays. However, in my Python and Cython implementations, the LUT approach is slower than calculating XOR and addition operations.

There are other approaches to implement a faster RS encoder, but I don't have the abilities nor the time to try them out. I will leave them as references for other interested readers:

  • "Fast software implementation of finite field operations", Cheng Huang and Lihao Xu, Washington University in St. Louis, Tech. Rep (2003). link and a correct code implementation here.
  • Luo, Jianqiang, et al. "Efficient software implementations of large finite fields GF (2 n) for secure storage applications." ACM Transactions on Storage (TOS) 8.1 (2012): 2.
  • "A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage.", Plank, J. S. and Luo, J. and Schuman, C. D. and Xu, L., and Wilcox-O'Hearn, Z, FAST. Vol. 9. 2009. link Or also the non extended version: "A Performance Comparison of Open-Source Erasure Coding Libraries for Storage Applications", Plank and Schuman.
  • Sourcecode of the ZFEC library, with multiplication LUT optimization link.
  • "Optimized Arithmetic for Reed-Solomon Encoders", Christof Paar (1997, June). In IEEE International Symposium on Information Theory (pp. 250-250). INSTITUTE OF ELECTRICAL ENGINEERS INC (IEEE). link
  • "A Fast Algorithm for Encoding the (255,233) Reed-Solomon Code Over GF(2^8)", R.L. Miller and T.K. Truong, I.S. Reed. link
  • "Optimizing Galois Field arithmetic for diverse processor architectures and applications", Greenan, Kevin and M., Ethan and L. Miller and Thomas JE Schwarz, Modeling, Analysis and Simulation of Computers and Telecommunication Systems, 2008. MASCOTS 2008. IEEE International Symposium on. IEEE, 2008. link
  • Anvin, H. Peter. "The mathematics of RAID-6." (2007). link and link
  • Wirehair library, one of the only few implementations of Cauchy Reed-Solomon, which is said to be very fast.
  • "A logarithmic Boolean time algorithm for parallel polynomial division", Bini, D. and Pan, V. Y. (1987), Information processing letters, 24(4), 233-237. See also Bini, D., and V. Pan. "Fast parallel algorithms for polynomial division over an arbitrary field of constants." Computers & Mathematics with Applications 12.11 (1986): 1105-1118. link
  • Kung, H.T. "Fast evaluation and interpolation." (1973). link
  • Cao, Zhengjun, and Hanyue Cao. "Note on fast division algorithm for polynomials using Newton iteration." arXiv preprint arXiv:1112.4014 (2011). link
  • "An Introduction to Galois Fields and Reed-Solomon Coding", James Westall and James Martin, 2010. link
  • Mamidi, Suman, et al. "Instruction set extensions for Reed-Solomon encoding and decoding." Application-Specific Systems, Architecture Processors, 2005. ASAP 2005. 16th IEEE International Conference on. IEEE, 2005. link
  • Dumas, Jean-Guillaume, Laurent Fousse, and Bruno Salvy. "Simultaneous modular reduction and Kronecker substitution for small finite fields." Journal of Symbolic Computation 46.7 (2011): 823-840.
  • Greenan, Kevin M., Ethan L. Miller, and Thomas Schwarz. Analysis and construction of galois fields for efficient storage reliability. Vol. 9. Technical Report UCSC-SSRC-07, 2007. link

However, I think the best lead is to use an efficient polynomial modular reduction instead of polynomial division:

  • "Modular Reduction in GF (2 n) without Pre-computational Phase". Kneževic, M., et al. Arithmetic of Finite Fields. Springer Berlin Heidelberg, 2008. 77-87.
  • "On computation of polynomial modular reduction". Wu, Huapeng. Technical report, Univ. of Waterloo, The Centre for applied cryptographic research, 2000.
  • "A fast software implementation for arithmetic operations in GF (2n)". De Win, E., Bosselaers, A., Vandenberghe, S., De Gersem, P., & Vandewalle, J. (1996, January). In Advances in Cryptology—Asiacrypt'96 (pp. 65-76). Springer Berlin Heidelberg. link
  • Barnett reduction

/EDIT: in fact it seems "On computation of polynomial modular reduction" just uses the same approach as I did with the variants rsenc_alt1() and rsenc_alt2() (the main idea being that we precompute the couples of coefficients we will need, and reduce them all at once), and unluckily it's not faster (it's actually slower because the precomputation cannot be done once for all since it depends on the message input).

/EDIT: I found a library with really interesting optimizations, lots that are not even found in any academic papers (which the author stated he has read btw), and which is probably the fastest software implementation of Reed-Solomon: the wirehair project and the related blog for more details. Worth of noting, the author also made a Cauchy-Reed-Solomon codec called longhair with similar optimization tricks.

/FINAL EDIT: it seems the fastest implementation available is based on this paper:

Plank, James S., Kevin M. Greenan, and Ethan L. Miller. "Screaming fast Galois field arithmetic using intel SIMD instructions." FAST. 2013. link

The implementation, in pure Go, is available here and is authored by Klaus Post. It's the fastest implementation I have ever read about, both in single thread and parallelized (it supports both). It claims over 1GB/s in single thread and over 4 GB/s with 8 threads. However, it relies on optimized SIMD instructions and various low-level optimizations on matrix operations (because here the RS codec is matrix oriented instead of the polynomial approach I have in my question).

So, if you are an interested reader and want to find the fastest Reed-Solomon codec available, that's the one.

Alternatively, if you know C, I would recommend to rewrite this Python function in plain C and call it (say with CFFI). At least you know you reach the top performance in the inner loops of your functions without needing to be aware of either PyPy or Cython tricks.

See: http://cffi.readthedocs.org/en/latest/overview.html#performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!