mike's blog

faster 𝔽_q arithmetic

I've been working on a crypto math library, which I'm intending to post (in alpha form) and whitepaperize before too long, but in the meantime I'm going to mention some of the tricks I've found for math over 𝔽q. These tricks are probably known already, so if they're yours, email me and I'll cite you. I'm mainly comparing my results to MPFQ and MIRACL. In my limited testing, I was unable to coax competitive results out of MIRACL, but I probably have the compilation flags wrong, so I've been comparing more to MPFQ.

multiplication

Techniques for Montgomery multiplication have been studied extensively; I relied most on the work of Koç, Acar and Kaliski. In their terminology, I found a variant of FIPS (Finely Integrated Product Scanning) to be fastest on x86-64 hardware. Product scanning — focusing on a given place-value in the product, instead of in one of the operands — avoids juggling carries, which is painful on x86 architectures. Furthermore, at least in theory it plays nicely with the deeply pipelined multiplier on some Intel chips, because it does not block results using low-order bits on those using high-order bits. The result is a multiplier which is around twice as fast as MPFQ on my Sandy Bridge laptop, costing just over 100 cycles for a 256-bit Montgomery multiply (includes function call latency, but the timings are confused some by TurboBoost).

I conjecture, but have not yet verified, that an operand-scanning mechanism like FIOS would be faster on ARM. This is because carries can be handled with the UMAAL (unsigned multiply double accumulate long) instruction, which lets you use an entire word as a carry flag. Almost the entire multiplication can be done using UMAAL, without the need for any other form of carry tracking.

reduction

It has been noted before (eg by Hachez and Quisquater) that the final subtraction of a Montgomery multiply is not strictly necessary. This is, I think, more true than most cryptographers realize. Montgomery multiplication on numbers a and b computes (ab+xp)/2w, where w is the size of the multiply in bits and x is some fudge factor less than 2w. This means that if a and b are both at most 2w/2√p then the result will be at most 2p. Thus if p is sufficiently less than 2w and the algorithm has all its addition and subtraction chains broken up by multiplications, the operands will remain bounded. Note that this makes reductions after additions and subtractions unnecessary as well, which is important as constant-time reductions are expensive. The results will not of course be fully reduced, but they will be correct modulo p. On a 64-bit machine this works very well for p near 2160 or 2224 where w is 192 or 256, respectively. Signed wide multiplication is expensive, so negative numbers can be dealt with by adding a suitable multiple of p before multiplication.

Something more might be gained by lazy reduction, i.e. by accumulating several products and then reducing mod p. However, except in special cases, this disrupts integrated reduction algorithms (especially FIPS). I have yet not tested performance with lazy reduction, so there may be significant gains to be had, especially on ARM where FIPS is probably unprofitable, or in quadratic extension multiplies where special cases can occur.

extension fields

For quadratic extension fields, complex squaring and Karatsuba multiplication seem to work best, as previously reported by Devegili, hÉigeartaigh, Scott and Dahab. However, there is another optimization opportunity for cubic-over-quadratic fields. To construct the quadratic field, adjoin i=√-1 instead of √-2 (this makes some operations slightly faster anyway). Then to multiply over the cubic field, use Toom-Cook: not on (0,±1,2,∞) but on (0,±1,±i). The result looks like a Fourier multiplication algorithm, complete with butterfly gates. By my count this algorithm uses 27 add/subs instead of 35 add/subs for standard Toom-Cook-3 and 15 for Karatsuba-3, so Toom-Cook trades 12 adds for a multiply instead of 20. This trick may even outperform Chung-Hasan for squaring because it uses 5 squarings and no multiplies, and the subfield's "complex" squaring algorithm is much faster than multiplication. What's more, when using the above reduction trick, the structure of this "Toom-Cook-3i" algorithm allows us to make all the inputs to the subfield multiplication algorithm positive by adding a single bias on the subfield term of each operand (though this doesn't work quite as well as we'd like because Karatsuba doubles the bias).

The above algorithm computes 4 times the product instead of 6 times, so the actual product can be recovered more quickly with two divisions by 4. What's more, as mentioned by DhÉSB such divisions may not be necessary. In the case of pairings, the extra factor will be canceled by the final exponentiation, but even when we are not computing pairings, an extra factor can be compensated for by dividing all cubic-extension elements by 4. This trick works exactly the same way as compensating for Montgomery multiplication's division by 2w. It does not affect multiplication by subfield elements, but it does affect addition, subtraction and conversion of subfield elements.

Similar algorithms may be used in other cases. Multiplication in quartic-over-quadratic fields can use a 6th root of unity (if it is not in the base field), and sextic-over-quartic can use a 5th root of unity. It may also be practical to use a direct sextic-over-quadratic extension, speeding up Toom-Cook simply taking advantage of the additional low-norm elements in the subfield.

I've also tried this trick (on paper) with cube roots of unity, the golden ratio and √±2, and so far √-1 is best except in the quartic-over-quadratic case.

I hope to soon finish implementing elliptic-curve arithmetic and pairings using these techniques. Then we'll see if I can get a significant speedup over the current state of the art.

— Mike Hamburg, August 2, 2011, 2:50 AM