I have a code in which i am computing x % 25. x always takes a positive value but its dynamic range is large.
I found out that this particular code piece of computing a x % 25 is taking large cycles. I need to optimize it.
Pre-computed lookup table is ruled out due to the possible large memory size of the table.
As second approach i coded a fragment below(C code) -
mod(a, b)
{
int r = a;
while(r >= b)
{
r = r - b;
}
return r;
}
1.) How can i optimize this code further for cycles(squeeze it to max)?
2.) Is there any entirely different optimized way to achieve x % 25( i know its not a common operation, but still, looking for clever inputs people might have used in their experience which might nelp me.).
Thank you.
-AD
EDIT:
I think using a native modulo operator % in C , internally uses a division operation (/) which is costly on the processor i am using.(No div instruction). hence trying to see if custom implemetation can beat the inherent computation using % operator.
-AD
I suggest reading Hacker's Delight. It describes very fast remainder algorithms for constant divisors. They would almost certainly beat a general algorithm.
Update: Here is some example code... It can probably be reworked to avoid the temporary long long.
unsigned mod25(unsigned n)
{
unsigned reciprocal = 1374389535; // 2^35 / 25
unsigned div25 = ((unsigned long long)n * reciprocal) >> 35;
return n - div25 * 25;
}
Here's another solution I came up with:
int mod25(int x){
/* 25 * (all powers of 2 <= INT_MAX), descending */
if (x >= 1677721600) x -= 1677721600;
if (x >= 838860800) x -= 838860800;
if (x >= 419430400) x -= 419430400;
if (x >= 209715200) x -= 209715200;
if (x >= 104857600) x -= 104857600;
if (x >= 52428800) x -= 52428800;
if (x >= 26214400) x -= 26214400;
if (x >= 13107200) x -= 13107200;
if (x >= 6553600) x -= 6553600;
if (x >= 3276800) x -= 3276800;
if (x >= 1638400) x -= 1638400;
if (x >= 819200) x -= 819200;
if (x >= 409600) x -= 409600;
if (x >= 204800) x -= 204800;
if (x >= 102400) x -= 102400;
if (x >= 51200) x -= 51200;
if (x >= 25600) x -= 25600;
if (x >= 12800) x -= 12800;
if (x >= 6400) x -= 6400;
if (x >= 3200) x -= 3200;
if (x >= 1600) x -= 1600;
if (x >= 800) x -= 800;
if (x >= 400) x -= 400;
if (x >= 200) x -= 200;
if (x >= 100) x -= 100;
if (x >= 50) x -= 50;
if (x >= 25) x -= 25;
return x;
}
This doesn't use divides or multiplys, just 27 comparisons and a maximum of 27 subtractions.
It's a little hard to convince yourself that this works, but it does (at least for non-negative values of x).
The above code is really an unrolled version of this:
int mod25(int x){
int divisor;
for(int divisor = 1677721600; divisor >= 25; divisor >>= 1) {
if (x >= divisor) x -= divisor;
}
return x;
}
By unrolling it we avoid doing the loop comparison and also the shifts at the expense of larger code. You could even partially unroll it using Duff's device if you felt so inclined, but with only 27 iterations total, and such a tiny bit of code per-iteration, I'd be inclined to just unroll it all the way.
Here's how it works: Every non-negative integer x can be expressed as (n * 25) + k where n is a non-negative integer and k is an integer from 0 to 24. k also happens to be the result we want, so if we could compute x - (n * 25) we'd get our answer. We want to be able to do this without knowing n up-front, though.
Think about n in binary. If we could turn off each of the 1 bits we'd get 0. One way to do this is to start at large powers of 2 and work our way down, subtracting each power of 2 only if the current value of n is greater than or equal to that power of 2.
Since we're dealing with (n * 25) we actually need descending powers of 2 times 25. Since k is strictly less than 25, and the smallest divisor we ever consider is 25, this works even when we're dealing with (n * 25) + k.
So each comparison + subtraction is zeroing out one bit of n, and at the end we're left with k, the remainder.
Since you want the modulus by a constant, you can probably beat it by using reciprocal multiplication. This paper shows how you can divide by a constant in such a manner, and towards the end, how to get the remainder from it.
Here's the best I could come up with:
int mod25(int x)
{
while((x = (x & 31) + 7 * (x >> 5)) >= 25)
x -= 25;
return x;
}
It approximates x % 25 with x % 32 + 7 * (x/32). The value will overshoot by a multiple of 25, which allows for recursion.
Performance seems to be adequate: A value of x = 2147483647 (aka INT_MAX) needs 11 iterations.
I was inspired by Pax's answer and made a more general purpose algorithm.
int mod(int a, int b) {
int s = b;
while (s <= a) {
s <<= 1;
}
int r = a;
while (r >= b) {
s >>= 1;
if (s <= r) {
r -= s;
}
}
return r;
}
This subtracts power of two multiples of b from a until the result is found.
EDIT: added the if condition to make it work properly.
As an example, if this is doing 100 % 7, it first works out that 7 * 2 * 2 * 2 * 2 = 112. Then it divides 112 (s) by 2 and subtracts that from 100 (r) (when s <= r) and continually does this until the modulo is found. Therefore,
s = 112 / 2 = 56, r = 100 - 56 = 44
s = 56 / 2 = 28, r = 44 - 28 = 16
s = 28 / 2 = 14, r = 16 - 14 = 2
therefore, 100 % 7 = 2
Oh my <deity of choice>. I can't believe some of these answers.
First thing, repeated subtraction, even Pax's version, will never, ever be optimal. Consider, the following:
20 % 25
that's easy and fast using repeated subtraction, but:
65535 % 25
will be horribly slow, 600+ iterations. That's an average of 300 iterations for 16 bit numbers. As for 32 bit number, well, just don't even go there.
The fastest way to do this is to use long division. See Niki's answer.
But, this is what the compiler will be generating anyway, at least, one would hope it is what the compiler is generating. It's always best to check if you're using a compiler for a niche processor.
The best way to speed this up is to not do the modulus in the first place. Why do you need to get the modulus and can you re-factor the code / algorithm to avoid the modulus, or at least, make the modulus trivial.
The problem with your loop is that it's O(n) - it'll be very slow for large values of r. I'd suggest something like this:
for (int s = MAX_SHIFT; s>=0; s--)
if (r > (b<<s)) r -= (b<<s);
But I doubt that your compiler is doing anything much more expensive than that.
On many processors, integer multiplication is faster than integer division. This blog post shows how to replace a constant integer division with a constant integer multiplication. By rearranging the maths a bit you can get the remainder instead of the quotient. Note, however, that if you are using a moderately sophisticated compiler, then this is already done for you. You just write x % 25 and the compiler works out the rest. You should check the generated assembly code for your code, verifying that the compiler has not done this already, before doing this optimisation in C. Also, you should measure (profile) the performance before and after to ensure that you really are making things faster.
Looping will be far slower than doing the division using the native instruction for reasonably large operands.
Edit: see also this paper.
If your C compiler is targeting a CPU with no divide instruction, you can modify your code as follows:
mod(a, b) {
int s = b + b + b + b;
int r = a;
while(r >= s) {
r -= s;
}
while(r >= b) {
r -= b;
}
return r;
}
This works by subtracting the values in chunks of four rather than one, right up until the last one then it switches to subtracting chunks of one.
This should make your code run about four times as fast (assuming 4*b isn't outside the range of your integers). You could even insert more loops (say an 8*b one) before the 4*b one for even more speed.
Other than that, hand-coding assembler may help but I think you'll find quite a boost from the above code without it.
If you know more detail on the way you'll be using the mod call, you can optimize it for your particular cases. For example, if you only want to know modulo 25 of a 16-bit integer, the following code will be much faster than a simplistic loop with variable denominator.
int mod25 (int a) { // a has maximum value of 2^15-1 = 32767
while (a >= 15625) a-= 15625; // at most 2 times.
while (a >= 625) a-= 625; // at most 24 times.
while (a >= 25) a-= 25; // at most 24 times.
return a;
}
Running a test, I find that you have to do 10 million iterations before a noticeable difference appears between that modulo code and the use of the % operator (2 seconds vs. 0 seconds). Up until that point, they were both 0 seconds, although that was run on a fast machine (better for mod25) and with a div instruction (better for % operator) so you'd need to benchmark it on your own hardware.
This is about as fast as you're likely to get without making your code unreadable (although even that shouldn't stop you if you're willing to add lots of comments explaining how it works).
A more general solution for any denominator is to first double the denominator (with bit shifts for speed) as far as possible so that the ensuing subtractions are minimized. Then, as the numerator reduces below the increased denominator, halve the denominator and keep going (until the denominator is back at the start).
int mod (int n, int d) {
/* dx is the adjusted denom, don't let it overflow though. */
int dx = d;
while (((dx << 1) >>1) == dx)
dx <<= 1;
/* This loop processes the dx values until they get too small. */
while (dx >= d) {
/* This loop subtracts the large dx value. */
while (n >= dx)
n -= dx;
dx >>= 1;
}
return n;
}
This actually performs on par with the optimized version of mod25 above while providing a more general solution.
please engage some common sense.
If you could write C code that calculated x % 25 faster than the compiler can, then the compiler would use that faster method.
The original poster made this fantastic assumption that the compiler would use a division. No compiler that I've used in the last ten years would be doing that. It's a multiplication by a constant close to (2^32 / 25) plus some bit twiddling that you won't be able to improve by hand.
There is a remote possibility that you can produce faster code than the compiler to find out whether x % 25 == 0, because you don't actually need code that will calculate x % 25 correctly, only code that calculates x % 25 correctly if it is 0 and doesn't produce a 0 if x % 25 != 0. Savings will probably be sub-nanosecond.
"How do I calculate x % c optimally for various constants c" is a nice puzzle. Compiler writers like nice puzzles. And they are better at solving nice puzzles like this than you are. Especially since they only need a solution that works for one machine where you would have to produce a general solution.
If you don't like % operator:
int mod(int a, int b) {
int integral = a / b;
return a - (b*integral);
}
If you know that b will be a power of 2, you could use bitwise AND instead of the modulo operator. However, the wikipedia page for modulo seems to indicate that any C compiler would notice this and optimize out the modulo anyway.
Possibly not the fastest but reasonably efficient. I haven't got time to test, but use a look up table of (powers of 2) * 25 up to the maximum range/2. Then do a loop. E.g. range up to 3199 needs 7 iterations.
static int pow[] = {25, 50, 100, 200, 400, 800, 1600};
int mod25(int x)
{
int i = sizeof pow /sizeof pow[0];
while (i--)
{
if (x >= pow[i])
x -= pow[i];
}
return x;
}
If you have a very large range but low values are more common then it might be worthwhile usng a binary chop to find the starting point.
int mod25(int x) {
static int divisors[] = {2147483625, 244140625, 9765625, 390625, 15625, 625, 25};
int i;
for (i = 0; i < sizeof(divisors)/sizeof(int); i++) {
int divisor = divisors[i];
while (x >= divisor) {
x -= divisor;
}
}
return x;
}
How it works: We want to decrement x by large multiples of 25 to reduce the value as fast as possible. When the divisor is too big we switch to a smaller multiple of 25. If the divisor is already down to 25 then we're done.
You could try experimenting with different divisors. You just want to make sure that:
- they're descending
- they're all multiples of 25
- the last value is 25
In the code above I used the largest signed-32-bit multiple of 25 plus the powers of 25, which seems reasonable, though I have to admit that I'm not sure that it's optimal.
(BTW: if your compiler doesn't do constant folding -- which would be very surprising -- then you might want to replace the upper-limit of i with a hard-coded constant.)
Why can't you just use the operator %? If this is C code, and the numbers are ordinary "native" int:s, then that should be the fastest way, by far.
Is there a reason why you cant use C's built in modulus operator?
int a = x % 25;
Following your edit;
If your rpocessor does not have built in modulo support then I would still use the % operator for the simple reason that your compiler will know that the processor in question doesnt have a native % function, and will likely produce asm code to optimally emulate it.
Put it this way - I'd be fascinated if you can come up with a genarl algorithm that outperforms whatevr the compiler produces from using the built in operator, notwithsatanding specific cases (such as simply taking the 2 lowest digits for modulo 100 etc)
How about:
int y = 0, x = (x & 0x7f);
while (x > 25) { x -= 25; y++; }
Update: it's pretty wrong :) But the idea is there.
I find it pretty strange that the operation x % 25 takes such long time (if you are using the built-in % operator, that is). Most modern processors should do this in a single instruction. I'd look for other reasons that this code takes so long.
EDIT: Here's an algorithm that might at least give some ideas:
256 = 6 (mod 25)
This means that if we write a number x as bytes x3 x2 x1 x0 we have that x = 6^3*x3 + 6^2*x2 + 6*x1 + x0 (mod 25)
This gives an algorithm for reducing the size of x:
int x0 = x & 0xFF, x1 = (x>>8) & 0xFF, x2 = (x>>16) & 0xFF, x3 = (x>>24) & 0xFF;
int y = x4;
y = (y << 2) + (y << 1) + x3;
y = (y << 2) + (y << 1) + x2;
y = (y << 2) + (y << 1) + x1;
y = (y << 2) + (y << 1) + x0;
(here (y << 2) + (y << 1) = 4*y + 2*y = 6*y)
After this y will have the same remainder as x mod 25.
Iterating this 1, 2 or 3 times will make y a 17, 11, or 9 bit number, respectively. One of these sizes might be small enough to make a lookup table of.
I SERIOUSLY doubt that this would be faster than the builtin % operator, though.
If you kept your numbers in BCD or a byte array of digits, this would be pretty easy. Unfortunately, I have no idea what else you're doing in your program with these numbers. Sometimes it pays to look at how you represent your data rather than just bang away at algorithms.
Heres an Idea
static int table0[256];
static int table1[256];
static int table2[256];
static int table3[256];
// ran just once to initialize the tables
void initialMod25Tables() {
for (int i = 0; i < 256; ++i) {
table0[i] = i % 25;
}
for (int i = 0; i < 256; ++i) {
table1[i] = (i << 8) % 25;
}
for (int i = 0; i < 256; ++i) {
table2[i] = (i << 16) % 25;
}
for (int i = 0; i < 256; ++i) {
table3[i] = (i << 24) % 25;
}
}
int mod25(int x) {
int y = table0[x & 0xFF];
x >>= 8;
y += table1[x & 0xFF];
x >>= 8;
y += table2[x & 0xFF];
x >>= 8;
y += table3[x & 0xFF];
y = table0[y];
return y;
}
If you are only considering the number 25 you can use the fact that 25 divies an integer if and only if the last two digits of the integer are 00, 25, 50 or 75. So to get the modulo you consider the last two digits and then subtract the nearest of 00, 25, 50 or 75.
来源:https://stackoverflow.com/questions/980702/efficient-cycles-wise-algorithm-to-compute-modulo-25