In my computer this code takes 17 seconds (1000 millions times):
static void Main(string[] args) {
var sw = new Stopwatch(); sw.Start();
int r;
for
This is really just a comment, but I don't get enough room.
Here is some C# using Math.DivRem()
:
[Fact]
public void MathTest()
{
for (var i = 1; i <= 10; i++)
{
int remainder;
var result = Math.DivRem(10, i, out remainder);
// Use the values so they aren't optimized away
Assert.True(result >= 0);
Assert.True(remainder >= 0);
}
}
Here is the corresponding IL:
.method public hidebysig instance void MathTest() cil managed
{
.custom instance void [xunit]Xunit.FactAttribute::.ctor()
.maxstack 3
.locals init (
[0] int32 i,
[1] int32 remainder,
[2] int32 result)
L_0000: ldc.i4.1
L_0001: stloc.0
L_0002: br.s L_002b
L_0004: ldc.i4.s 10
L_0006: ldloc.0
L_0007: ldloca.s remainder
L_0009: call int32 [mscorlib]System.Math::DivRem(int32, int32, int32&)
L_000e: stloc.2
L_000f: ldloc.2
L_0010: ldc.i4.0
L_0011: clt
L_0013: ldc.i4.0
L_0014: ceq
L_0016: call void [xunit]Xunit.Assert::True(bool)
L_001b: ldloc.1
L_001c: ldc.i4.0
L_001d: clt
L_001f: ldc.i4.0
L_0020: ceq
L_0022: call void [xunit]Xunit.Assert::True(bool)
L_0027: ldloc.0
L_0028: ldc.i4.1
L_0029: add
L_002a: stloc.0
L_002b: ldloc.0
L_002c: ldc.i4.s 10
L_002e: ble.s L_0004
L_0030: ret
}
Here is the (relevant) optimized x86 assembly generated:
for (var i = 1; i <= 10; i++)
00000000 push ebp
00000001 mov ebp,esp
00000003 push esi
00000004 push eax
00000005 xor eax,eax
00000007 mov dword ptr [ebp-8],eax
0000000a mov esi,1
{
int remainder;
var result = Math.DivRem(10, i, out remainder);
0000000f mov eax,0Ah
00000014 cdq
00000015 idiv eax,esi
00000017 mov dword ptr [ebp-8],edx
0000001a mov eax,0Ah
0000001f cdq
00000020 idiv eax,esi
Note the 2 calls to idiv
. The first stores the remainder (EDX
) into the remainder
parameter on the stack. The 2nd is to determine the quotient (EAX
). This 2nd call is not really needed, since EAX
has the correct value after the first call to idiv
.
Grrr. The only reason for this function to exist is to take advantage of the CPU instruction for this, and they didn't even do it!
The answer is probably that nobody has thought this a priority - it's good enough. The fact that this has not been fixed with any new version of the .NET Framework is an indicator of how rarely this is used - most likely, nobody has ever complained.
If I had to take a wild guess, I'd say that whoever implemented Math.DivRem had no idea that x86 processors are capable of doing it in a single instruction, so they wrote it as two operations. That's not necessarily a bad thing if the optimizer works correctly, though it is yet another indicator that low-level knowledge is sadly lacking in most programmers nowadays. I would expect the optimizer to collapse modulus and then divide operations into one instruction, and the people who write optimizers should know these sorts of low-level things...
It's partly in the nature of the beast. There is to the best of my knowledge no general quick way to calculate the remainder of a division. This is going to take a correspondingly large amount of clock cycles, even with x hundred million transistors.
While .NET Framework 4.6.2 still uses the suboptimal modulo and divide, .NET Core (CoreCLR) currently replaces the divide with a subtract:
public static int DivRem(int a, int b, out int result)
{
// TODO https://github.com/dotnet/runtime/issues/5213:
// Restore to using % and / when the JIT is able to eliminate one of the idivs.
// In the meantime, a * and - is measurably faster than an extra /.
int div = a / b;
result = a - (div * b);
return div;
}
And there's an open issue to either improve DivRem specifically (via intrinsic), or detect and optimise the general case in RyuJIT.