I've been working on a crypto math library, which I'm intending to
post (in alpha form) and whitepaperize before too long, but in
the meantime I'm going to mention some of the tricks I've found
for math over 𝔽_{q}. These tricks are probably
known already, so if they're yours, email me and I'll cite you.
I'm mainly comparing my results to MPFQ and MIRACL. In my limited testing, I
was unable to coax competitive results out of MIRACL, but I
probably have the compilation flags wrong, so I've been comparing
more to MPFQ.

### multiplication

Techniques for Montgomery multiplication have been studied
extensively; I relied most on the work of
Koç, Acar and Kaliski. In their terminology, I found a variant of
FIPS (Finely Integrated Product Scanning) to be fastest on x86-64
hardware. Product scanning — focusing on a given place-value in
the product, instead of in one of the operands — avoids juggling
carries, which is painful on x86 architectures. Furthermore, at
least in theory it plays nicely with the deeply pipelined
multiplier on some Intel chips, because it does not block results
using low-order bits on those using high-order bits. The result
is a multiplier which is around twice as fast as MPFQ on my Sandy
Bridge laptop, costing just over 100 cycles for a 256-bit
Montgomery multiply (includes function call latency, but the
timings are confused some by TurboBoost).

I conjecture, but have not yet verified, that an operand-scanning
mechanism like FIOS would be faster on ARM. This is because carries
can be handled with the UMAAL (unsigned multiply double accumulate
long) instruction, which lets you use an entire word as a carry
flag. Almost the entire multiplication can be done using UMAAL,
without the need for any other form of carry tracking.

### reduction

It has been noted before (eg by Hachez
and Quisquater) that the final subtraction of a Montgomery
multiply is not strictly necessary. This is, I think, more true
than most cryptographers realize. Montgomery multiplication on
numbers *a* and *b* computes
(*ab+xp*)/2^{w}, where *w* is the size of
the multiply in bits and *x* is some fudge factor less than
2^{w}. This means that if *a* and *b* are
both at most 2^{w/2}√p then the result will be at
most 2*p*. Thus if *p* is sufficiently less than
2^{w} and the algorithm has all its addition and
subtraction chains broken up by multiplications, the operands will
remain bounded. Note that this makes reductions after additions and
subtractions unnecessary as well, which is important as
constant-time reductions are expensive. The results will not of
course be fully reduced, but they will be correct modulo *p*.
On a 64-bit machine this works very well for *p* near
2^{160} or 2^{224} where *w* is 192 or 256,
respectively. Signed wide multiplication is expensive, so negative
numbers can be dealt with by adding a suitable multiple of *p*
before multiplication.

Something more might be gained by lazy reduction, i.e. by
accumulating several products and then reducing mod *p*.
However, except in special cases, this disrupts integrated reduction
algorithms (especially FIPS). I have yet not tested performance
with lazy reduction, so there may be significant gains to be had,
especially on ARM where FIPS is probably unprofitable, or in
quadratic extension multiplies where special cases can occur.

### extension fields

For quadratic extension fields, complex squaring and Karatsuba
multiplication seem to work best, as previously
reported by Devegili, hÉigeartaigh, Scott and Dahab. However,
there is another optimization opportunity for cubic-over-quadratic
fields. To construct the quadratic field, adjoin *i*=√-1
instead of √-2 (this makes some operations slightly faster anyway).
Then to multiply over the cubic field, use Toom-Cook: not on
(0,±1,2,∞) but on (0,±1,±*i*). The result looks like a Fourier
multiplication algorithm, complete with butterfly gates. By my
count this algorithm uses 27 add/subs instead of 35 add/subs for
standard Toom-Cook-3 and 15 for Karatsuba-3, so Toom-Cook trades 12
adds for a multiply instead of 20. This trick may even outperform
Chung-Hasan
for squaring because it uses 5 squarings and no multiplies, and the
subfield's "complex" squaring algorithm is much faster than
multiplication. What's more, when using the above reduction trick,
the structure of this "Toom-Cook-3*i*" algorithm allows us to
make all the inputs to the subfield multiplication algorithm
positive by adding a single bias on the subfield term of each
operand (though this doesn't work quite as well as we'd like because
Karatsuba doubles the bias).

The above algorithm computes 4 times the product instead of 6
times, so the actual product can be recovered more quickly with two
divisions by 4. What's more, as mentioned by DhÉSB such divisions
may not be necessary. In the case of pairings, the extra factor
will be canceled by the final exponentiation, but even when we are
not computing pairings, an extra factor can be compensated for by
dividing all cubic-extension elements by 4. This trick works
exactly the same way as compensating for Montgomery multiplication's
division by 2^{w}. It does not affect multiplication
by subfield elements, but it does affect addition, subtraction and
conversion of subfield elements.

Similar algorithms may be used in other cases. Multiplication in
quartic-over-quadratic fields can use a 6th root of unity (if it is
not in the base field), and sextic-over-quartic can use a 5th root
of unity. It may also be practical to use a direct
sextic-over-quadratic extension, speeding up Toom-Cook simply taking
advantage of the additional low-norm elements in the subfield.

I've also tried this trick (on paper) with cube roots of unity,
the golden ratio and √±2, and so far √-1 is best except in the
quartic-over-quadratic case.

I hope to soon finish implementing elliptic-curve arithmetic and
pairings using these techniques. Then we'll see if I can get a
significant speedup over the current state of the art.