1. Montgomery multiplication
1.1. Single-precision Montgomery multiplication
1.1.1. Description
Montgomery multiplication works out xyR–1 mod N without
division, where R > N
and gcd(R, N) =
1 on the condition of xy
< RN.
In practice, N
is chosen to be odd and R
in form of 2n
with R > N.
Montgomery multiplication: montMUL(x, y, N, R) = xyR–1
mod N
Input: x
< N
y < N
N, the
modulus, odd
R, R > N and R in the form of 2n for an n.
Output: xyR-1 mod N
Procedure montMUL(x,
y, N, R):
1. J = -N–1 mod R; // can be pre-calculated
see Algorithm J.
2. T = x*y ;
3. q = (T mod R)*J
mod R;
4. t = (T + q*N)/R;
5. if (t ³ N)
6.
return t
- N;
7.
else
8.
return t
;
End_of_procedure
Because R is chosen as R = 2n, the divisions by R, at Line
3 and Line 4, are simply value extractions.
1.1.2. Numerical example
Given x = 5, y = 10, N = 13, calculate xy mod N with Montgomery arithmetic.
Step 1: We represent all the numbers in hexadecimal as
follows:
x = 5;
y = A;
N = D;
Step 2: The R can be chosen R = 0x10. (note: the 10 is 16 in decimal)
Step 3: Calculate constants J and H as follows:
J = -N–1 mod
R = -D-1 mod
10 = B.
H = R2 mod N = 10 mod D = 9.
Step 4: Applying single-precision montMUL(x, y, N, R) as follows:
1.
J = B //
pre-calculated
2.
T = x*y
= 5*A = 3C.
3.
q = (T
mod R)*J mod R = (3C mod 10)*B mod 10 = C*B
mod 10 = 84 mod 10 = 4.
4.
t = (T +
q*N) /R = (3C + 4*D)/10 = 70/10=
7 (< N)
\ xyR–1 mod N = 7
Step 5: Calculating montMUL(xyR–1 mod N, H)
= xy mod N:
1. J = B; H = 9; //
pre-calculated
2. T = 7*9 = 3F;
3. q = F*B mod 10 = A5 mod
10 = 5;
4. t = (3F + 5*D)/10 = 80/10 = 8;
\ xy mod N = 8.
1.2. Multi-precision Montgomery multiplication
1.2.1. Description
In the multi-precision case of Montgomery multiplication,
it is convenient to choose R = (2w)n,
where w
is the number of bits of a CPU word and the n is the number of CPU words that the R
occupies.
The following procedure works out xyR-1 mod N in multi-precision case for an odd N.
Multi-precision
Montgomery multiplication:
montMUL(x, y, N, R) = xyR–1 mod N
Input: x[s-1, … , 0]
y[s-1, … , 0]
N[s-1, … , 0]
Output:
t[2s-1, … , s] = xyR-1 mod N
Local variables:
t[2s, 2s-1, … , s , s – 1, … , 0] // 2s + 1 words
c, d, and q, 1-word registers
Procedure: montMUL(x,
y, N, R)
1. J = -N[0]–1
mod 2w // can be pre-calculated by Algorithm J. Note: here J is only one CPU
word */
2. t[2s] = 0; t[2s –
1, … , 0] = x*y; // 2s + 1 words
3. for i = 0 to s –1
// from right to left
4. {
5. c = 0; // c is the carry
6. q = t[i]*J mod 2w; // take the
lower half
7. for j = 0 to s – 1
8. {
9. (c,d) = t[i +j ] + q*N[j] + c;
10. t[i + j] = d;
11. }
12. t[2s, … , i + s] = t[2s, … , i + s] + c;
13. }
14. if t[2s, 2s–1, … , s] > N // can be optimised
15. then
t[2s, 2s–1, … , s] = t[2s, 2s–1, … , s] – N;
16. return t[2s–1, … ,
s]; // note the difference
End_of_procedure
1.2.2. Numerical example
Given x = 93, y =
167, N = 237 in decimals,
calculate xy mod N with multi-precision Montgomery multiplication.
Pencil and paper: 93*126
mod 237 = 126
Step 1: We represent all the numbers in hexadecimal as
follows:
x
= 0x5D;
y
= 0xA7;
N
= 0xED; // N[1] = 0xE, N[0] = 0xD;
Step 2: The R is chosen R = 0x100. (note: 0x100 is 256 in decimal)
Step 3: Calculate constants J and H as
follows:
J = -N[0]–1
mod R = -D-1 mod
10 = B;
H = R2 mod N = 10000 mod ED = 7C;
Step 4: Applying multi-precision montMUL(x, y) as follows:
(s = 2)
1.
J = B;
// pre-calculated
2.
t[4,3,2,1,0]
= (0, x*y) = (0, 5D*A7) = 03CAB
3.
i = 0
4.
{ // the
i = 0 loop
5.
c = 0;
6.
q = t[0]*J mod 10 = B*B
mod 10 = 79 mod 10 = 9
7.
j = 0
8.
{ the j = 0 loop
9.
(c,d) = t[0 + 0] + q*N[0] + 0 = B + 9*D + 0 = 80
//c = 8 and d = 0 at this point;
10.
t[0 + 0] = t[0] = d = 0;
11.
} // j += 1; Line 8
8. {the
j = 1 loop
9. (c,d) = t[0 + 1] + q*N[1] + 8 = A + 9*E
+ 8 = 90
// c = 9, d = 0
10. t[0 + 1] = t[1] = d = 0;
11. }
// j+= 1; j = 2 now and exit of loop j
// t[4,3,2,1,0] = 03C00 now
12.
t[4,3,2] = t[4,3,2] + c = 03C + 9 = 045;
// t[4,3,2,1,0] = 04500 now
13.
} // i +=
1 and go to Line 4
4. { i =
1
5. c =
0;
6. q =
t[1]*J mod 10 = 0*J mod 10 = 0;
7. j =
0;
8. { the
j = 0 loop
9. (c,d) = t[1] + 0 + 0 = 00;
// c = 0, d = 0;
10. t[1 + 0] = t[1] = d = 0;
11. } //
j += 1 and go to Line 8
8. { j
= 1 loop
9. (c,d) = t[2] + 0 + 0 = 05;
// c = 0, d = 5;
10. t[1 + 1] = t[2] = d = 5;
11. }
// j += 1; j = 2, exits
// t[4,3,2,1,0] = 04500
12. t[4,3]
= t[4,3] + c = 04 + 0 = 04;
13. } // i += 1. exit
//
t[4,3,2,1,0] = 04500;
14. 045 < ED; // N = 0xED
15. // nothing
16. return t[3,2] = 45;
\5D*A7*R–1 mod N = 0x45;
Step 5: By applying montMUL() again to achieve
montMUL(45,H) = montMUL(45,7C)= 0x7E. (126 in decimal)
1.3. J, the inverse of –N modulo 2w
As we can see, an efficient algorithm is needed to work out
the J in the above two algorithms. The J
can be calculated either by a general Extended Euclidean Algorithm or by the
Algorithm J below which takes advantage of R
= 2w.
Algorithm J: J = – N–1 mod 2w
Input: N
Output:
J = – N–1 mod 2w
Procedure: J = montJ(N)
1.
A = 1;
// A is a local variable
2.
for i =
0 to w – 1 // lsb to msb
3.
{
4.
if (A is odd) then // the lsb == 0
5.
J(bit i) = 1; // the i-th bit of J
6.
else
7.
J(bit i) = 0;
8.
A = (A + J(bit i)*N)/2;
9.
}
End_of_procedure
Correctness proof: use exists k, such that JN
+ 1 = k2w .
In implementation, the bit i of J
is right-shifted into J
not by setting bits.
2. Montgomery arithmetic
2.1. Normal domain vs. Montgomery domain
We loosely say an integer ring, ZN or a GF(p), which we normally deal with, as
normal domains. For an odd N,
as well as p,
and an R in the
form of 2n,
greater than N, we can make a 1-to-1 mapping as follows:
Given N
and R, we have
1. montMUL(x, H) = xR
mod N
2. montMUL(xR, 1) = x
mod N
3. montMUL(xR, yR) = xyR
mod N (Montgomery domain is closed)
4. montEXP(xR, k) = xkR
mod N (Montgomery
exponentiation)
To perform the above, we need to calculate:
1. J = –N–1
mod R
2. H = R2 mod N
Sometimes, in Elliptic Curves Cryptography, we may need
U
= R3 mod
N
which can be achieved by U = montMUL(H, H) = R3 mod
N and used for
montMUL(x–1R–1,
U) = x–1R mod N
to adjust an inverse, achieved by the Extended Euclidean
Algorithm, back to the Montgomery domain.
3. Further readings and studies
The following papers are recommended:
1. Peter
L. Montgomery, "Modular multiplication without trial division".
Mathematics of Computation. 44: 519–521, 1985
2. C.
K. Koc, T. Acar and B. Kaliski, “Analyzing and Comparing Montgomery
Multiplication Algorithms”, IEEE Micro, 16(3):26-33, June 1996
3. B.
Kaliski, “The Montgomery inverse and its applications”, IEEE Transactions on
Computers (Volume: 44 , Issue: 8 , Aug 1995
The above #1 is classic, which provides the theory.
The #2 is a good refence for implementation considerations.
When a designer is given a CPU architecture, it would be very helpful to study
this paper before coding.
The #3 provides an alternative for obtaining inverse in
Montgomery domain to the method mentioned in section 2.2.
Commonly used arithmetic.
It is recommended to make a worked numerical example of Algorithm J with N = 0xD (1101) and w = 4 by completing the following
table:
i
|
Bit i
|
A + J(bit i)*N
|
A
|
-
|
-
|
-
|
0001
|
0
|
1
|
01110
|
0111
|
1
|
1
|
?
|
?
|
2
|
0
|
?
|
?
|
3
|
1
|
?
|
?
|
J
= 1011 = 0xB