Galois Field math implementation

I might come back and revise this at a later date, but for now I'm going to the Wikiversity article to get a basic understanding of the characteristic 2 Galois Fields and how to construct them as that was what I used.

Instead I'll be discussing implementation details that are different from the norm and why I did it that way.

Primitive Polynomial

We choose the primitive polynomial with the lowest Hamming weight (bits set/terms) and degree (not counting the highest term/bit since that must be set by definition). This isn't necessarily something out of the ordinary as most implementations probably do this but there is additional reasoning behind this choice that isn't required by software look up table based implementations and was not mentioned in any examples I saw. Doing this simplifies parallel modular reduction calculations in the case that one actually performs long field multiplication as opposed to a log lookup, add, exponent lookup. It's especially efficient in the case that only bits 0 and 1 are set.

Primitive Element

Sometimes referred to as $α$ or the generator of the field. We choose 2, which is again standard but also not typically discussed. It's the easiest one to construct the log and exponent lookup tables with and the only one shared by all characteristic 2 fields.

Packed Polynomials

The polynomials in my implementation consist of multiple symbols worth of terms packed end to end such that there are no unused bits between them. The lowest order terms are in the least significant bits with increasing order terms in increasingly significant bits. In practice this means that if you use printf() to print a gf8_poly value in octal (%o) or a gf16_poly value in hexadecimal (%llX), each digit will correspond to a symbol with the right most being the degree 0 aka constant term. This term order means that indexing an individual term is always (poly >> (term_order * symbol_size)) & symbol_mask. In practice this indexing is always done in terms of symbol size to avoid frequently multiplying by the symbol size.

The benefits of this are that whole Reed-Solomon polynomial blocks for $GF(8)$ and $GF(16)$ can be packed into a single uint32 and uint64 respectively and easily passed on the stack and many of the most frequently used math operations can be applied to the whole block at once. The simplest example of this is that polynomial addition/subtraction is simply a single XOR operation on the whole block for these characteristic 2 fields. Additionally, Reed-Solomon encoding and decoding of arbitrary data is simplified because the regardless of the data being split across several symbols, it's still physically contiguous and doesn't require padding be inserted or removed at either end.

Polynomial Scaling/Multiply

This is where the main benefit of packed polynomials comes into play. In the case of not having special hardware support for carry-less polynomial multiply such as the CLMUL instruction set on x86, then the typical implementation uses log and exponent table look-ups to compute the result of multiplies on a term by term basis. If we have polynomials of length $n$ terms and symbol size $p$ bits, a look-up based algorithm can simultaneously compute $p^2$ bit interactions and must iterate over each symbol combination for a total complexity of $O(n)$ for scaling and $O(n^2)$ for multiply. If we take the alphabet size of the field to be $q$ then $p=log_2(q)$ and a full size block for Reed-Solomon is $n=q-1$ so $p\approxeq log_2(n)$ at best and typically not worse than $n$. This means that scaling via shift and XOR will operate on $n*p$ bits simultaneously and have complexity $O(p)$ operations and multiply in $O(n*p)$ if you can operate on the whole block at once. Since $n$ is typically anywhere from much larger than $p$ to just as big as $p$, this is potentially a significant improvement.

There are however some minor caveats to this. If the native register size is not enough to fit an entire block, operations must now be split again eating into the improvements and that detractor scales at some fraction of $n$. Additionally, because of the tightly packed nature of the polynomials, we must use p accumulators to prevent overflow from one symbol before reduction from clobbering the next symbol. As an aside, if you do have access to CLMUL or the equivalent on your architecture you would need to space the symbols apart with $p-1$ bits of padding to prevent such clobbering. At the end of all the shifting and XORing we can separate the overflow bits from the rest and use them to effectively apply modular reduction to the entire polynomial simultaneously, however the difficulty of the actual process of doing that reduction in parallel is linked to the Hamming weight and degree (ignoring the highest set bit) of the primitive polynomial as higher Hamming weights and degrees can self interfere in the overflow region, requiring more steps to cancel out the overflow. Despite all of this, from my rough calculations there should still generally be speed benefits to packed polynomials at up to $GF(2^7)$ if operating on a 64-bit machine, although if you meet those requirements, it's likely that it also has some variant of CLMUL in which case that is always the better option.

In terms of what this practically means for Reed-Solomon, we see the most speed benefit for small fields and high numbers of check symbols since typically that is the defining length of what is being multiplied and since I intend to use this for black and white fiducial markers that have occlusion resistance, this is exactly the kind of environment it would excel in.

Polynomial Division/Evaluation

Division and evaluation is implemented according to the synthetic division algorithm. Since one of the steps of the algorithm involves multiplying the divisor polynomial by a single term, this can be computed using a polynomial scaling.

Provide feedback

Saved searches