I\'m interested in how efficient low-level algorithms can be in .net. I would like to enable us to choose to write more of our code in C# rather than C++ in the future, but
First of all, I would like to thank everyone who spoken out in this post, from original OP to the guys who provided extremely detailed and insightful explanations. I really, really enjoyed reading the existing answers. Since there is already plentiful of theory of how and why the loops work in the way they do, I would like to offer some empirical (by some definition authoritative) measurements:
Conclusions:
.Length property.unsafe fixed is not faster than normal For loop.Benchmarking code:
using System;
using System.Diagnostics;
using System.Runtime;
namespace demo
{
class MainClass
{
static bool ByForArrayLength (byte[] data)
{
for (int i = 0; i < data.Length; i++)
if (data [i] != 0)
return false;
return true;
}
static bool ByForLocalLength (byte[] data)
{
int len = data.Length;
for (int i = 0; i < len; i++)
if (data [i] != 0)
return false;
return true;
}
static unsafe bool ByForUnsafe (byte[] data)
{
fixed (byte* datap = data)
{
int len = data.Length;
for (int i = 0; i < len; i++)
if (datap [i] != 0)
return false;
return true;
}
}
static bool ByForeach (byte[] data)
{
foreach (byte b in data)
if (b != 0)
return false;
return true;
}
static void Measure (Action work, string description)
{
GCSettings.LatencyMode = GCLatencyMode.LowLatency;
var watch = Stopwatch.StartNew ();
work.Invoke ();
Console.WriteLine ("{0,-40}: {1} ms", description, watch.Elapsed.TotalMilliseconds);
}
public static void Main (string[] args)
{
byte[] data = new byte[256 * 1024 * 1024];
Measure (() => ByForArrayLength (data), "For with .Length property");
Measure (() => ByForLocalLength (data), "For with local variable");
Measure (() => ByForUnsafe (data), "For with local variable and GC-pinning");
Measure (() => ByForeach (data), "Foreach loop");
}
}
}
Results: (uses Mono runtime)
$ mcs Program.cs -optimize -unsafe
For with .Length property : 440,9208 ms
For with local variable : 333,2252 ms
For with local variable and GC-pinning : 330,2205 ms
Foreach loop : 280,5205 ms
One way to be sure that bounds checking is not performed is to use pointers, which you can do in C# in unsafe mode (this requires you to set a flag in the project properties):
private static unsafe double SumProductPointer(double[] X, double[] Y)
{
double sum = 0;
int length = X.Length;
if (length != Y.Length)
throw new ArgumentException("X and Y must be same size");
fixed (double* xp = X, yp = Y)
{
for (int i = 0; i < length; i++)
sum += xp[i] * yp[i];
}
return sum;
}
I tried measuring your original method, your method with the X.Length change and my code using pointers, compiled both as x86 and x64 under .Net 4.5. Specifically, I tried computing the method for vectors of length 10 000 and ran the method 10 000 times.
The results are pretty much in line with Michael Liu's answer: there is no measurable difference between the three methods, which means that bounds checking either isn't done or that its effect on performance is insignificant. There was measurable difference between x86 and x64 though: x64 was about 34 % slower.
Full code I used:
static void Main()
{
var random = new Random(42);
double[] x = Enumerable.Range(0, 10000).Select(_ => random.NextDouble()).ToArray();
double[] y = Enumerable.Range(0, 10000).Select(_ => random.NextDouble()).ToArray();
// make sure JIT doesn't affect the results
SumProduct(x, y);
SumProductLength(x, y);
SumProductPointer(x, y);
var stopwatch = new Stopwatch();
stopwatch.Start();
for (int i = 0; i < 10000; i++)
{
SumProduct(x, y);
}
Console.WriteLine(stopwatch.ElapsedMilliseconds);
stopwatch.Restart();
for (int i = 0; i < 10000; i++)
{
SumProductLength(x, y);
}
Console.WriteLine(stopwatch.ElapsedMilliseconds);
stopwatch.Restart();
for (int i = 0; i < 10000; i++)
{
SumProductPointer(x, y);
}
Console.WriteLine(stopwatch.ElapsedMilliseconds);
}
private static double SumProduct(double[] X, double[] Y)
{
double sum = 0;
int length = X.Length;
if (length != Y.Length)
throw new ArgumentException("X and Y must be same size");
for (int i = 0; i < length; i++)
sum += X[i] * Y[i];
return sum;
}
private static double SumProductLength(double[] X, double[] Y)
{
double sum = 0;
if (X.Length != Y.Length)
throw new ArgumentException("X and Y must be same size");
for (int i = 0; i < X.Length; i++)
sum += X[i] * Y[i];
return sum;
}
private static unsafe double SumProductPointer(double[] X, double[] Y)
{
double sum = 0;
int length = X.Length;
if (length != Y.Length)
throw new ArgumentException("X and Y must be same size");
fixed (double* xp = X, yp = Y)
{
for (int i = 0; i < length; i++)
sum += xp[i] * yp[i];
}
return sum;
}
The 64-bit jitter does a good job of eliminating bounds checks (at least in straightforward scenarios). I added return sum; at the end of your method and then compiled the program using Visual Studio 2010 in Release mode. In the disassembly below (which I annotated with a C# translation), notice that:
X, even though your code compares i against length instead of X.Length. This is an improvement over the behavior described in the article.Y.Length >= X.Length.Disassembly
; Register assignments:
; rcx := i
; rdx := X
; r8 := Y
; r9 := X.Length ("length" in your code, "XLength" below)
; r10 := Y.Length ("YLength" below)
; r11 := X.Length - 1 ("XLengthMinus1" below)
; xmm1 := sum
; (Prologue)
00000000 push rbx
00000001 push rdi
00000002 sub rsp,28h
; (Store arguments X and Y in rdx and r8)
00000006 mov r8,rdx ; Y
00000009 mov rdx,rcx ; X
; int XLength = X.Length;
0000000c mov r9,qword ptr [rdx+8]
; int XLengthMinus1 = XLength - 1;
00000010 movsxd rax,r9d
00000013 lea r11,[rax-1]
; int YLength = Y.Length;
00000017 mov r10,qword ptr [r8+8]
; if (XLength != YLength)
; throw new ArgumentException("X and Y must be same size");
0000001b cmp r9d,r10d
0000001e jne 0000000000000060
; double sum = 0;
00000020 xorpd xmm1,xmm1
; if (XLength > 0)
; {
00000024 test r9d,r9d
00000027 jle 0000000000000054
; int i = 0;
00000029 xor ecx,ecx
0000002b xor eax,eax
; if (XLengthMinus1 >= YLength)
; throw new IndexOutOfRangeException();
0000002d cmp r11,r10
00000030 jae 0000000000000096
; do
; {
; sum += X[i] * Y[i];
00000032 movsd xmm0,mmword ptr [rdx+rax+10h]
00000038 mulsd xmm0,mmword ptr [r8+rax+10h]
0000003f addsd xmm0,xmm1
00000043 movapd xmm1,xmm0
; i++;
00000047 inc ecx
00000049 add rax,8
; }
; while (i < XLength);
0000004f cmp ecx,r9d
00000052 jl 0000000000000032
; }
; return sum;
00000054 movapd xmm0,xmm1
; (Epilogue)
00000058 add rsp,28h
0000005c pop rdi
0000005d pop rbx
0000005e ret
00000060 ...
00000096 ...
The 32-bit jitter, unfortunately, is not quite as smart. In the disassembly below, notice that:
X, even though your code compares i against length instead of X.Length. Again, this is an improvement over the behavior described in the article.Y.Disassembly
; Register assignments:
; eax := i
; ecx := X
; edx := Y
; esi := X.Length ("length" in your code, "XLength" below)
; (Prologue)
00000000 push ebp
00000001 mov ebp,esp
00000003 push esi
; double sum = 0;
00000004 fldz
; int XLength = X.Length;
00000006 mov esi,dword ptr [ecx+4]
; if (XLength != Y.Length)
; throw new ArgumentException("X and Y must be same size");
00000009 cmp dword ptr [edx+4],esi
0000000c je 00000012
0000000e fstp st(0)
00000010 jmp 0000002F
; int i = 0;
00000012 xor eax,eax
; if (XLength > 0)
; {
00000014 test esi,esi
00000016 jle 0000002C
; do
; {
; double temp = X[i];
00000018 fld qword ptr [ecx+eax*8+8]
; if (i >= Y.Length)
; throw new IndexOutOfRangeException();
0000001c cmp eax,dword ptr [edx+4]
0000001f jae 0000005A
; sum += temp * Y[i];
00000021 fmul qword ptr [edx+eax*8+8]
00000025 faddp st(1),st
; i++;
00000027 inc eax
; while (i < XLength);
00000028 cmp eax,esi
0000002a jl 00000018
; }
; return sum;
0000002c pop esi
0000002d pop ebp
0000002e ret
0000002f ...
0000005a ...
The jitter has improved since 2009, and the 64-bit jitter can generate more efficient code than the 32-bit jitter.
If necessary, though, you can always bypass array bounds checks completely by using unsafe code and pointers (as svick points out). This technique is used by some performance-critical code in the Base Class Library.
The bounds check won't matter because:
The bounds check consists of a cmp/jae instruction pair, which is fused into a single micro-op on modern CPU architectures (the term is "macro-op fusion"). Compare and branch is very highly optimized.
The bounds check is a forward branch, which will be statically predicted to be not-taken, also reducing the cost. The branch will never be taken. (If it ever is taken, an exception will throw anyway, so the mispredict cost becomes utterly irrelevant)
As soon as there is any memory delay, speculative execution will queue up many iterations of the loop, so the cost of decoding the extra instruction pair almost disappears.
Memory access will likely be your bottleneck, so the effect micro-optimizations like removing bounds checks will disappear.