microbenchmark | 易学教程

Why jnz requires 2 cycles to complete in an inner loop

阅读更多关于 Why jnz requires 2 cycles to complete in an inner loop

问题 I'm on an IvyBridge. I found the performance behavior of jnz inconsistent in inner loop and outer loop. The following simple program has an inner loop with fixed size 16: global _start _start: mov rcx, 100000000 .loop_outer: mov rax, 16 .loop_inner: dec rax jnz .loop_inner dec rcx jnz .loop_outer xor edi, edi mov eax, 60 syscall perf tool shows the outer loop runs 32c/iter. It suggests the jnz requires 2 cycles to complete. I then search in Agner's instruction table, conditional jump has 1-2

How to use Caliper benchmark beta snapshot without maven?

阅读更多关于 How to use Caliper benchmark beta snapshot without maven?

问题 I have been asked to use Google's Caliper project to create a few microbenchmarks. I would very much like to use the annotation features of the newest beta snapshot, but aside from a few small examples I am having trouble finding good documentation on how to actually run the thing... There is a video tutorial up which instructs users on the new maven integration feature, which I was also asked NOT to use. Right now I just have a small example stripped from one of theirs, modified with some

Long latency instruction

阅读更多关于 Long latency instruction

问题 I would like a long-latency single-uop x86 1 instruction, in order to create long dependency chains as part of testing microarchitectural features. Currently I'm using fsqrt , but I'm wondering is there is something better. Ideally, the instruction will score well on the following criteria: Long latency Stable/fixed latency One or a few uops (especially: not microcoded) Consumes as few uarch resources as possible (load/store buffers, page walkers, etc) Able to chain (latency-wise) with itself

Fastest Linux system call

阅读更多关于 Fastest Linux system call

问题 On an x86-64 Intel system that supports syscall and sysret what's the "fastest" system call from 64-bit user code on a vanilla kernel? In particular, it must be a system call that exercises the syscall / sysret user <-> kernel transition 1 , but does the least amount of work beyond that. It doesn't even need to do the syscall itself: some type of early error which never dispatches to the specific call on the kernel side is fine, as long as it doesn't go down some slow path because of that.

Fastest Linux system call

阅读更多关于 Fastest Linux system call

JMH: don't take into account inner method time

阅读更多关于 JMH: don't take into account inner method time

问题 I have: Methods like this: @GenerateMicroBenchmark public static void calculateArraySummary(String[] args) { // create a random data set /* PROBLEM HERE: * now I measure not only pool.invoke(finder) time, * but also generateRandomArray method time */ final int[] array = generateRandomArray(1000000); // submit the task to the pool final ForkJoinPool pool = new ForkJoinPool(4); final ArraySummator finder = new ArraySummator(array); System.out.println(pool.invoke(finder)); } private static int[]

Capturing (externally) the memory consumption of a given Callback

阅读更多关于 Capturing (externally) the memory consumption of a given Callback

问题 The Problem Lets say I have this function: function hog($i = 1) // uses $i * 0.5 MiB, returns $i * 0.25 MiB { $s = str_repeat('a', $i * 1024 * 512); return substr($s, $i * 1024 * 256); } I would like to call it and be able to inspect the maximum amount of memory it uses. In other words: memory_get_function_peak_usage($callback); . Is this possible? What I Have Tried I'm using the following values as my non-monotonically increasing $i argument for hog() : $iterations = array_merge(range(0, 50,

How to minimize the costs for allocating and initializing an NSDateFormatter?

阅读更多关于 How to minimize the costs for allocating and initializing an NSDateFormatter?

问题 I noticed that using an NSDateFormatter can be quite costly. I figured out that allocating and initializing the object already consumes a lot of time. Further, it seems that using an NSDateFormatter in multiple threads increases the costs. Can there be a blocking where the threads have to wait for each other? I created a small test application to illustrate the problem. Please check it out. http://github.com/johnjohndoe/TestNSDateFormatter git://github.com/johnjohndoe/TestNSDateFormatter.git

Calculate time encryption of AES/CCM in Visual Studio 2017

阅读更多关于 Calculate time encryption of AES/CCM in Visual Studio 2017

问题 I am using the library Crypto++ 5.6.5 and Visual Studio 2017. How can I calculate the encryption time for AES-CCM? 回答1: I would like to know how to calculate the encryption time for AES-CCM. The Crypto++ wiki provides an article Benchmarks. It provides a lot of details regarding library performance, how throughput is calculated, and it even references the source code where the actual throughput is measured. Believe it or not, a simple call to clock works just fine to measure bulk encryption.

Why do two consecutive calls to the same method yield different times for execution?

阅读更多关于 Why do two consecutive calls to the same method yield different times for execution?

问题 Here is a sample code: public class TestIO{ public static void main(String[] str){ TestIO t = new TestIO(); t.fOne(); t.fTwo(); t.fOne(); t.fTwo(); } public void fOne(){ long t1, t2; t1 = System.nanoTime(); int i = 10; int j = 10; int k = j*i; System.out.println(k); t2 = System.nanoTime(); System.out.println("Time taken by 'fOne' ... " + (t2-t1)); } public void fTwo(){ long t1, t2; t1 = System.nanoTime(); int i = 10; int j = 10; int k = j*i; System.out.println(k); t2 = System.nanoTime();