Modular multiplicative inverse compiler optimizations

Preface

After I got interested in cryptography, I started reading a bit about random number generator. I read this article http://www.pcg-random.org/posts/a-quick-look-at-xoshiro256.html written by the excellent random expert, and when I got to the section of modular inverse I was stunned. I started to think how to embed this idea into my first prototype of symmetry block encryption algorithm, but then I realized that this concept has much greater potential in code optimizations. I discovered that (as usual) somebody had thought about that idea before me (Hackers delight), but yet I was left with some space to innovate.

The Method

As a background to read this Article, I suggest reading this article for those who are unfamiliar with modular multiplicative inverse.
Everyone knows that division operation is a heavy task for the processor. And so is the modulus operation. Finding the modular inverse of a number is done by a simple algorithm, which is out of the scope of this article.

Currently, many compilers are using some methods to substitute division with some multiplications and bit shifting. I want to suggest an optimization method that optimize modulus operation in the following format:

given ‘x’ an n-bit length unsigned int variable and two odd integer constants ‘D’, ‘R’ (D is for divisor, R is for remainder):

bool check_mod(unsigned x)
{
    return x % D == R;
}

Now, the modular multiplicative inverse (let’s call it m.m.i for short), conforms to the following equations:

[all operations are performed on n bits integers, 
division rounded toward zero 
and overflows wraps around.]
[D is an odd unsigned integer]
[x is an unsigned integer value]
mmi(mmi(D))=D
mmi(D)*x=x/D+(x%D)*mmi(D)

Now, it implies that all the integers that are dividable by D (where D is odd), will reside in the first floor((2n-1)/D) after we multiply them by mmi(D). So multiplying by mmi(D) is a one to one transformation that converts :

0*D => multiplied by mmi(D) => 0

1*D => multiplied by mmi(D) => 1

m*D => multiplied by mmi(D) => m

while ‘m’ equals floor((2n-1)/D).

for example if D is 5, and n is 8 (integers in the range [0,255]) then mmi(D) is 205. (5*205 = 1025. which gives the modulus of 1)

(0 * 205) % 256 = 0

(5 * 205) % 256 = 1

(10 * 205) % 256 = 2

(15 * 205) % 256 = 3

.

.

.

(255 * 205) % 256 = 51

In this case m equals 51 which is floor((2n-1)/D) as we stated above: floor(255/5) = 51.

So now we know that if X is dividable by D, if we simply multiple X by mmi(D), we will get a number that is less or equal to [m], which is actually the quotient of the division X/D.

So originally we intended to convert this piece of code to something faster.

bool check_mod(unsigned x)
{
    return x % D == R;
}

We first look at the case where D is an odd number.

For simplicity, we will focus now on the case where R equals 0, and we will be using 64 bit integers for this example:

bool check_mod(uint64 x)
{
    return x % D == 0; //D is odd
}

So from what we learned above, this could be easily converted to the following:

x*mmi(D)<=m

or like this:

bool check_mod(uint64 x)
{
    return x * mmi64(D) <= (ULLONG_MAX/D) ; //D is odd
}

* UULONG_MAX is actually (264-1).

** mmi(D) and (ULLONG_MAX/D) are calculated by the compiler at compile-time.

Hold tight, things get a little bit more complex from now.

All we talked about until now is the case that R equals to zero.

Now let’s move to the more general case where R isn’t zero.

bool check_mod(unsigned x)
{
    return x % D == R; //D is odd
}

remember that equation that I introduced to you?

mmi(D)*x=x/D+(x%D)*mmi(D)

We want to check whether x%D equals to R.

let us subtract R*mmi(D) from both sides of the equation…

x*mmi(D)-R*mmi(D) = x/D+(x%D)*mmi(D)-R*mmi(D)

x*mmi(D)-R*mmi(D) = x/D+(x%D-R)*mmi(D)

if R is equal to x%D then the result that we will get will be in the range:

0 <= mmi(D)*x – R*mmi(D) <= floor((2n-1-R)/D)

What is this number    floor((2n-1-R)/D)    you probably ask? It is the greatest quotient resulted from dividing by ‘D’ an integer in the range [0,2n-1] that yields the remainder of R. So the range of the result should be as stated above.

Let’s examine this further:

0*D + R => multiplied by mmi(D) => 0 + R*mmi(D)

1*D + R => multiplied by mmi(D) => 1 + R*mmi(D)

2*D + R => multiplied by mmi(D) => 2 + R*mmi(D)

3*D + R => multiplied by mmi(D) => 3 + R*mmi(D)

.

.

.

m*D + R => multiplied by mmi(D) => m + R*mmi(D)

where ‘m’ equals to floor((2n-1-R)/D).

Note that:

(m+1)*D+R > 2n-1  .

Let’s see an example:

n = 6 => range [0,63]

D = 25

mmi(D) = 41

So I made 25 tables for you, one for each R (remainder). Note that the first 14 tables have 3 rows and the other 11 have only two row. The number of row for specific remainder is determined by the equation

rows = (63-R)/25 + 1

these are the pairs of an integer in the range (left) with its multiplication by mmi(D) = 41 (right):

0 0
25 1
50 2
1 41
26 42
51 43
2 18
27 19
52 20
3 59
28 60
53 61
4 36
29 37
54 38
5 13
30 14
55 15
6 54
31 55
56 56
7 31
32 32
57 33
8 8
33 9
58 10
9 49
34 50
59 51
10 26
35 27
60 28
11 3
36 4
61 5
12 44
37 45
62 46
13 21
38 22
63 23
14 62
39 63
15 39
40 40
16 16
41 17
17 57
42 58
18 34
43 35
19 11
44 12
20 52
45 53
21 29
46 30
22 6
47 7
23 47
48 48
24 24
49 25

 


 

The case when the divisor is an EVEN number:

We split the divisor D into two parts:

D = Dodd*Dpow2

we derive k from Dpow2:

Dpow2 = 2k

bool check_mod(uint64 x) // unsigned
{
 k = ctzll(D);/* the power of 2 */
 uint64_t Dodd = D >> k; 
 return rotateRight64((x-R)*mmi64(Dodd),k) <= (ULLONG_MAX-R)/D;
}

when x-R is dividable by D, it means that it is also dividable by Dodd and by Dpow2, and vice versa. and if x-R is dividable by Dpow2, it means that the k least significant bits are zero. So when we do bit rotation, if any of the k least significant bits is not zero, the number that we will get after the rotation will always be larger than (ULLONG_MAX-R)/D, because  ULLONG_MAX+1 / Dpow2 > (ULLONG_MAX-R)/(Dpow2*Dodd).


Let us deal with signed integers (only the case when the divisor is positive). The problem with signed integers modulo is the inconsistency of implementations. Consider the following abstract code:

int num = -1;
int divisor = 7;
print (num % divisor);

Some implementations will produce the number 6 while others will produce -1.

So, for those implementations which produce negative numbers when the input number is negative, we simply check if the desired remainder that was given to be optimized is negative or positive.

If the remainder is negative, we need to ignore all the positive numbers, because they will never produce negative remainders.

bool check_mod(int64 x)
{
    return x % D == R; // R is negative
}

The formula for optimizing this case is as follow (all the math is done with unsigned integers) :

bool check_mod(int64 x)
{
    uint64 u_x = x; // u_x is the unsigned version of x
    u_x += ((1<<63)-R)-((1<<63)-R)%D; //carefully move the negative bound to the positive
    k = ctzll(D);/* the power of 2 */
    uint64_t Dodd = D >> k; 
    return rotateRight64((u_x)*mmi64(Dodd), k) <= ((1<<63)-R)/D; //unsigned comparison
}

If the remainder is positive, we need to ignore all the negative numbers, because they will never produce positive remainders.

bool check_mod(int64 x)
{
    return x % D == R; // R is positive
}

The formula for optimizing this case is as follow (all the math is done with unsigned integers) :

bool check_mod(int64 x)
{
    uint64 u_x = x; // u_x is the unsigned version of x
    k = ctzll(D);/* the power of 2 */
    uint64_t Dodd = D >> k; 
    return rotateRight64((u_x-R)*mmi64(Dodd),k) <= (ULLONG_MAX/2-R)/D; //unsigned comparison
}

The other case for those implementations which produce positive numbers when the input number is negative: (not verified yet)

bool check_mod(int64 x)
{
    uint64 u_x = x; // u_x is the unsigned version of x
    u_x += ((1<<63)-R)-((1<<63)-R)%D; //carefully move the negative bound to the positive
    k = ctzll(D);/* the power of 2 */
    uint64_t Dodd = D >> k; 
    return rotateRight64((u_x)*mmi64(Dodd),k) <= (ULLONG_MAX-2*R)/D; //unsigned comparison
}

THE FOLLOWING SECTION IS STILL IN EDIT MODE

bonus 1:

We define W as

W*2n + 1 = D * mmi(D)

eg. n = 8, D = 25, mmi(D) = 41,    41*25= 1025   => W = 4

When W is 1 (if, for example, n would be 10 in the example above), all the adjacent remainders will be adjacent to each other after the multiplication, so optimizing  x%D <= a is straight forward. When W is D-1 , the remainder will be in opposite order and it will also be easy to optimize.

When W is not 1, the distance between two adjacent remainder will be the m.m.i of W on modulo D (this time, the modular multiplicative inverse will not be on modulo 2n).

and by using 8bit integers, we will be able to optimize

with m.m.i of 4 above 25. (the jumps above the divisor)

x%25==a || x%25 == (a+19)%25

bonus 2:

if the code we want to optimize looks like that:

unsigned check_mod(unsigned x)
{
    if (x % D == R)
        return x/D;
    return 0;
}

We can notice that we already hold in our hands the quotient of the division.

Advertisements