Saturday, 4 July 2020

Montgomery Multiplication - worked numerical examples


1. Montgomery multiplication

1.1. Single-precision Montgomery multiplication

1.1.1. Description

Montgomery multiplication works out xyR–1 mod N without division, where R > N and gcd(R, N) = 1 on the condition of xy < RN.
In practice, N is chosen to be odd and R in form of 2n with R > N.
Montgomery multiplication: montMUL(x, y, N, R) = xyR–1 mod N
Input:  x < N
y < N
N, the modulus, odd
            R, R > N  and R in the form of 2n for an n.
Output: xyR-1 mod N
Procedure montMUL(x, y, N, R):
1.    J = -N–1 mod R; // can be pre-calculated see Algorithm J.
2.    T = x*y ;
3.    q = (T mod R)*J mod R;
4.    t = (T + q*N)/R;
5.    if (t ³ N)
6.        return t - N;
7.    else
8.        return t ;
End_of_procedure

Because R is chosen as R = 2n, the divisions by R, at Line 3 and Line 4, are simply value extractions.

1.1.2. Numerical example

Given x = 5, y = 10, N = 13, calculate xy mod N with Montgomery arithmetic.
Step 1: We represent all the numbers in hexadecimal as follows:
x = 5;
y = A;
N = D;
Step 2: The R can be chosen R = 0x10. (note: the 10 is 16 in decimal)
Step 3: Calculate constants J and H as follows:
J = -N–1 mod R = -D-1 mod 10 = B.
H = R2 mod N = 10 mod D = 9.
Step 4: Applying single-precision montMUL(x, y, N, R) as follows:
1.    J = B // pre-calculated
2.    T = x*y = 5*A = 3C.
3.    q = (T mod R)*J mod R = (3C mod 10)*B mod 10 = C*B mod 10 = 84 mod 10 = 4.
4.    t = (T + q*N) /R = (3C + 4*D)/10 = 70/10= 7 (< N)
\ xyR–1 mod N = 7
Step 5: Calculating montMUL(xyR–1 mod N, H) = xy mod N:
1.    J = B; H = 9; // pre-calculated
2.    T = 7*9 = 3F;
3.    q = F*B mod 10 = A5 mod 10 = 5;
4.    t = (3F + 5*D)/10 = 80/10 = 8;
\ xy mod N = 8.

1.2. Multi-precision Montgomery multiplication

1.2.1. Description

In the multi-precision case of Montgomery multiplication, it is convenient to choose R = (2w)n, where w is the number of bits of a CPU word and the n is the number of CPU words that the R occupies.
The following procedure works out xyR-1 mod N in multi-precision case for an odd N.
Multi-precision Montgomery multiplication: montMUL(x, y, N, R) = xyR–1 mod N
Input:   x[s-1, … , 0]
y[s-1, … , 0]
N[s-1, … , 0]
Output:
t[2s-1, … , s] = xyR-1 mod N
Local variables:
t[2s, 2s-1, … , s , s – 1, … , 0] // 2s + 1 words
c, d, and q, 1-word registers

Procedure: montMUL(x, y, N, R)
1.    J = -N[0]–1 mod 2w // can be pre-calculated by Algorithm J. Note: here J is only one CPU word */
2.    t[2s] = 0; t[2s – 1, … , 0] = x*y; // 2s + 1 words
3.    for i = 0 to s –1 // from right to left
4.    {
5.        c = 0;        // c is the carry
6.        q = t[i]*J mod 2w; // take the lower half
7.        for j = 0 to s – 1
8.        {
9.            (c,d) = t[i +j ] + q*N[j] + c;
10.           t[i + j] = d;
11.       }
12.       t[2s, … , i + s] = t[2s, … , i + s] + c;
13.   }
14.   if  t[2s, 2s–1, … , s] > N // can be optimised
15.       then  t[2s, 2s–1, … , s] = t[2s, 2s–1, … , s] – N;
16.   return t[2s–1, … , s]; // note the difference
End_of_procedure

1.2.2. Numerical example

Given x = 93, y = 167, N = 237 in decimals, calculate xy mod N with multi-precision Montgomery multiplication.
Pencil and paper: 93*126 mod 237 = 126
Step 1: We represent all the numbers in hexadecimal as follows:
x = 0x5D;
y = 0xA7;
N = 0xED; // N[1] = 0xE, N[0] = 0xD;
Step 2: The R is chosen R = 0x100. (note: 0x100 is 256 in decimal)
Step 3: Calculate constants J and H as follows:
J = -N[0]–1 mod R = -D-1 mod 10 = B;
H = R2 mod N = 10000 mod ED = 7C;
Step 4: Applying multi-precision montMUL(x, y) as follows: (s = 2)
1.    J = B; // pre-calculated
2.    t[4,3,2,1,0] = (0, x*y) = (0, 5D*A7) = 03CAB
3.    i = 0
4.    { // the i = 0 loop
5.        c = 0;
6.        q = t[0]*J mod 10 = B*B mod 10 = 79 mod 10 = 9
7.        j = 0
8.        { the j = 0 loop
9.            (c,d) = t[0 + 0] + q*N[0] + 0 = B + 9*D + 0 = 80
         //c = 8 and d = 0 at this point;
10.           t[0 + 0] = t[0] = d = 0;
11.       } // j += 1; Line 8
8.        {the j = 1 loop
9.            (c,d) = t[0 + 1] + q*N[1] + 8 = A + 9*E + 8 = 90
        // c = 9, d = 0
10.           t[0 + 1] = t[1] = d = 0;
11.       } // j+= 1; j = 2 now and exit of loop j
    // t[4,3,2,1,0] = 03C00 now
12.       t[4,3,2] = t[4,3,2] + c = 03C + 9 = 045;
    // t[4,3,2,1,0] = 04500 now
13.   } // i += 1 and go to Line 4
4.    {   i = 1
5.        c = 0;
6.        q = t[1]*J mod 10 = 0*J mod 10 = 0;
7.        j = 0;
8.        { the j = 0 loop
9.            (c,d) = t[1] + 0 + 0 = 00;
        // c = 0, d = 0;
10.           t[1 + 0] = t[1] = d = 0;
11.       } // j += 1 and go to Line 8
8.        { j = 1 loop
9.            (c,d) = t[2] + 0 + 0 = 05;
        // c = 0, d = 5;
10.           t[1 + 1] = t[2] = d = 5;
11.       } // j += 1; j = 2, exits
    // t[4,3,2,1,0] = 04500
12.       t[4,3] = t[4,3] + c = 04 + 0 = 04;
13.   } // i += 1. exit
// t[4,3,2,1,0] = 04500;
14.   045 < ED; // N = 0xED
15.   // nothing
16.   return t[3,2] = 45;
\5D*A7*R–1 mod N = 0x45;
Step 5: By applying montMUL() again to achieve
montMUL(45,H) = montMUL(45,7C)= 0x7E. (126 in decimal)

1.3. J, the inverse of N modulo 2w

As we can see, an efficient algorithm is needed to work out the J  in the above two algorithms. The J can be calculated either by a general Extended Euclidean Algorithm or by the Algorithm J below which takes advantage of R = 2w.
Algorithm J: J = – N–1 mod 2w
Input:  N
Output:          
J = – N–1 mod 2w
Procedure: J = montJ(N)
1.    A = 1; // A is a local variable
2.    for i = 0 to w – 1 // lsb to msb
3.    {
4.        if (A is odd) then // the lsb == 0
5.            J(bit i) = 1; // the i-th bit of J
6.        else
7.            J(bit i) = 0;
8.        A = (A + J(bit i)*N)/2;
9.    }
End_of_procedure

Correctness proof: use exists k, such that JN + 1 = k2w .
In implementation, the bit i of J is right-shifted into J not by setting bits.

2. Montgomery arithmetic

2.1. Normal domain vs. Montgomery domain

We loosely say an integer ring, ZN or a GF(p), which we normally deal with, as normal domains. For an odd N, as well as p, and an R in the form of 2n, greater than N, we can make a 1-to-1 mapping as follows:


Given N and R, we have

1.     montMUL(x, H) = xR mod N
2.     montMUL(xR, 1) = x mod N
3.     montMUL(xR, yR) = xyR mod N (Montgomery domain is closed)
4.     montEXP(xR, k) = xkR mod N (Montgomery exponentiation)

To perform the above, we need to calculate:

1.     J =  N–1 mod R
2.     H = R2 mod N

Sometimes, in Elliptic Curves Cryptography, we may need

            U = R3 mod N

which can be achieved by U = montMUL(H, H) = R3 mod N and used for

montMUL(x–1R–1, U) = x–1R mod N

to adjust an inverse, achieved by the Extended Euclidean Algorithm, back to the Montgomery domain.

3. Further readings and studies

The following papers are recommended:
1.     Peter L. Montgomery, "Modular multiplication without trial division". Mathematics of Computation. 44: 519–521, 1985
2.     C. K. Koc, T. Acar and B. Kaliski, “Analyzing and Comparing Montgomery Multiplication Algorithms”, IEEE Micro, 16(3):26-33, June 1996
3.     B. Kaliski, “The Montgomery inverse and its applications”, IEEE Transactions on Computers (Volume: 44 , Issue: 8 , Aug 1995

The above #1 is classic, which provides the theory.
The #2 is a good refence for implementation considerations. When a designer is given a CPU architecture, it would be very helpful to study this paper before coding.
The #3 provides an alternative for obtaining inverse in Montgomery domain to the method mentioned in section 2.2. Commonly used arithmetic.

It is recommended to make a worked numerical example of Algorithm J with N = 0xD (1101) and w = 4 by completing the following table:
i
Bit i
A + J(bit i)*N
A
-
-
-
0001
0
1
01110
0111
1
1
?
?
2
0
?
?
3
1
?
?
 J = 1011 = 0xB







Post Quantum Cryptography

The consensus appears that quantum computers will be built, and RSA and ECC will be broken by  Shor's algorithm  at some point. Although...