micro-optimization | 易学教程

Packing two DWORDs into a QWORD to save store bandwidth

阅读更多关于 Packing two DWORDs into a QWORD to save store bandwidth

问题 Imagine a load-store loop like the following which loads DWORD s from non-contiguous locations and stores them contiguously: top: mov eax, DWORD [rsi] mov DWORD [rdi], eax mov eax, DWORD [rdx] mov DWORD [rdi + 4], eax ; unroll the above a few times ; increment rdi and rsi somehow cmp ... jne top On modern Intel and AMD hardware, when running in-cache such a loop will usually bottleneck ones stores at one store per cycle. That's kind of wasteful, since that's only an IPC of 2 (one store, one

Do java finals help the compiler create more efficient bytecode? [duplicate]

阅读更多关于 Do java finals help the compiler create more efficient bytecode? [duplicate]

Possible Duplicate: Does use of final keyword in Java improve the performance? The final modifier has different consequences in java depending on what you apply it to. What I'm wondering is if additionally it might help the compiler create more efficient bytecode. I suppose the question goes deep into how the JVM work and might be JVM specific. So, in your expertise, do any of the following help the compiler, or do you only use them for the normal java reasons? Final classes Final methods Final fields Final method arguments Thanks! EDIT: Thanks for all your answers! Please note that, as

Can modern x86 implementations store-forward from more than one prior store?

阅读更多关于 Can modern x86 implementations store-forward from more than one prior store?

问题 In the case that a load overlaps two earlier stores (and the load is not fully contained in the oldest store), can modern Intel or AMD x86 implementations forward from both stores to satisfy the load? For example, consider the following sequence: mov [rdx + 0], eax mov [rdx + 2], eax mov ax, [rdx + 1] The final 2-byte load takes its second byte from the immediate preceding store, but its first byte from the store before that. Can this load be store-forwarded, or does it need to wait until

' … != null' or 'null != …' best performance?

阅读更多关于 ' … != null' or 'null != …' best performance?

I wrote two methods to check there performance public class Test1 { private String value; public void notNull(){ if( value != null) { //do something } } public void nullNot(){ if( null != value) { //do something } } } and checked it's byte code after compiling public void notNull(); Code: Stack=1, Locals=1, Args_size=1 0: aload_0 1: getfield #2; //Field value:Ljava/lang/String; 4: ifnull 7 7: return LineNumberTable: line 6: 0 line 9: 7 StackMapTable: number_of_entries = 1 frame_type = 7 /* same */ public void nullNot(); Code: Stack=2, Locals=1, Args_size=1 0: aconst_null 1: aload_0 2: getfield

latency vs throughput in intel intrinsics

阅读更多关于 latency vs throughput in intel intrinsics

问题 I think I have a decent understanding of the difference between latency and throughput, in general. However, the implications of latency on instruction throughput are unclear to me for Intel Intrinsics, particularly when using multiple intrinsic calls sequentially (or nearly sequentially). For example, let's consider: _mm_cmpestrc This has a latency of 11, and a throughput of 7 on a Haswell processor. If I ran this instruction in a loop, would I get a continuous per cycle-output after 11

Performance / Space implications when ordering SQL Server columns?

阅读更多关于 Performance / Space implications when ordering SQL Server columns?

Are there any considerations that should be taken into account when designing a new table with regards to the order in which columns should be declared? I tend to put the primary key first, followed by any foreign keys (usually surrogate key integers), followed by other columns, but a discussion with a colleague had us wondering whether SQL Server will pad our data, possibly to make it faster. Will SQL Server try and align our data on disk (with padding) to a specific byte alignment boundary for performance reasons (the way a C++ compiler would align a struct under default conditions) or will

Indexed branch overhead on X86 64 bit mode

阅读更多关于 Indexed branch overhead on X86 64 bit mode

问题 This question was migrated from Computer Science Stack Exchange because it can be answered on Stack Overflow. Migrated 2 years ago . This is a follow up to some comments made in this prior thread: Recursive fibonacci Assembly The following code snippets calculate Fibonacci, the first example with a loop, the second example with a computed jump (indexed branch) into an unfolded loop. This was tested using Visual Studio 2015 Desktop Express on Windows 7 Pro 64 bit mode with an Intel 3770K 3

Which of these pieces of code is faster in Java?

阅读更多关于 Which of these pieces of code is faster in Java?

a) for(int i = 100000; i > 0; i--) {} b) for(int i = 1; i < 100001; i++) {} The answer is there on this website (question 3). I just can't figure out why? From website: 3. a When you get down to the lowest level (machine code but I'll use assembly since it maps one-to-one mostly), the difference between an empty loop decrementing to 0 and one incrementing to 50 (for example) is often along the lines of: ld a,50 ld a,0 loop: dec a loop: inc a jnz loop cmp a,50 jnz loop That's because the zero flag in most sane CPUs is set by the decrement instruction when you reach zero. The same can't usually

Fastest way to strip all non-printable characters from a Java String

阅读更多关于 Fastest way to strip all non-printable characters from a Java String

What is the fastest way to strip all non-printable characters from a String in Java? So far I've tried and measured on 138-byte, 131-character String: String's replaceAll() - slowest method 517009 results / sec Precompile a Pattern, then use Matcher's replaceAll() 637836 results / sec Use StringBuffer, get codepoints using codepointAt() one-by-one and append to StringBuffer 711946 results / sec Use StringBuffer, get chars using charAt() one-by-one and append to StringBuffer 1052964 results / sec Preallocate a char[] buffer, get chars using charAt() one-by-one and fill this buffer, then convert

Why date() works twice as fast if we set time zone from code?

阅读更多关于 Why date() works twice as fast if we set time zone from code?

问题 Have you noticed that date() function works 2x faster than usual if you set actual timezone inside your script before any date() call? I'm very curious about this. Look at this simple piece of code: <?php $start = microtime(true); for ($i = 0; $i < 100000; $i++) date('Y-m-d H:i:s'); echo (microtime(true) - $start); ?> It just calls date() function using for loop 100,000 times. The result I’ve got is always around 1.6 seconds (Windows, PHP 5.3.5) but… If I set same time zone again adding one