cpu-architecture

How does a hardware trap in a three-past-the-end pointer happen even if the pointer is never dereferenced?

会有一股神秘感。 提交于 2019-11-30 04:02:28
问题 In his November 1, 2005 C++ column, Herb Sutter writes ... int A[17]; int* endA = A + 17; for( int* ptr = A; ptr < endA; ptr += 5 ) { // ... } [O]n some CPU architectures, including current ones, the aforementioned code can cause a hardware trap to occur at the point where the three-past-the-end pointer is created, whether that pointer is ever dereferenced or not. How does a CPU trap on a bitpattern? What about ... int A[17]; // (i) hardware will trap this ? int *pUgly = A + 18; // (ii)

Can the LSD issue uOPs from the next iteration of the detected loop?

允我心安 提交于 2019-11-30 03:23:16
问题 I was playing investigating the capabilities of the branch unit on port 0 of my Haswell starting with a very simple loop: BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 10000000 .loop: dec ecx ;| jz .end ;| 1 uOP (call it D) jmp .loop ;| 1 uOP (call it J) .end: mov eax, 60 xor edi, edi syscall Using perf we see that the loop runs at 1c/iter Performance counter stats for './main' (50 runs): 10,001,055 uops_executed_port_port_6 ( +- 0.00% ) 9,999,973 uops_executed_port_port_0 ( +- 0.00% )

How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows?

落爺英雄遲暮 提交于 2019-11-30 03:09:04
问题 I am on the hook to analyze some "timing channels" of some x86 binary code. I am posting one question to comprehend the bsf/bsr opcodes. So high-levelly, these two opcodes can be modeled as a "loop", which counts the leading and trailing zeros of a given operand. The x86 manual has a good formalization of these opcodes, something like the following: IF SRC = 0 THEN ZF ← 1; DEST is undefined; ELSE ZF ← 0; temp ← OperandSize – 1; WHILE Bit(SRC, temp) = 0 DO temp ← temp - 1; OD; DEST ← temp; FI;

Why were bitwise operations slightly faster than addition/subtraction operations on older microprocessors?

痴心易碎 提交于 2019-11-30 03:04:18
I came across this excerpt today: On most older microprocessors, bitwise operations are slightly faster than addition and subtraction operations and usually significantly faster than multiplication and division operations. On modern architectures, this is not the case: bitwise operations are generally the same speed as addition (though still faster than multiplication). I'm curious about why bitwise operations were slightly faster than addition/subtraction operations on older microprocessors. All I can think of that would cause the latency is that the circuits to implement addition/subtraction

Porting 32 bit C++ code to 64 bit - is it worth it? Why?

夙愿已清 提交于 2019-11-29 22:49:58
I am aware of some the obvious gains of the x64 architecture (higher addressable RAM addresses, etc)... but: What if my program has no real need to run in native 64 bit mode. Should I port it anyway? Are there any foreseeable deadlines for ending 32 bit support? Would my application run faster / better / more secure as native x64 code? x86-64 is a bit of a special case - for many architectures (eg. SPARC), compiling an application for 64 bit mode doesn't give it any benefit unless it can profitably use more than 4GB of memory. All it does is increase the size of the binary, which can actually

Critical sections with multicore processors

假如想象 提交于 2019-11-29 22:20:28
With a single-core processor, where all your threads are run from the one single CPU, the idea of implementing a critical section using an atomic test-and-set operation on some mutex (or semaphore or etc) in memory seems straightforward enough; because your processor is executing a test-and-set from one spot in your program, it necessarily can't be doing one from another spot in your program disguised as some other thread. But what happens when you do actually have more than one physical processor? It seems that simple instruction level atomicity wouldn't be sufficient, b/c with two processors

Determine target ISA extensions of binary file in Linux (library or executable)

纵饮孤独 提交于 2019-11-29 20:33:32
We have an issue related to a Java application running under a (rather old) FC3 on an Advantech POS board with a Via C3 processor. The java application has several compiled shared libs that are accessed via JNI. Via C3 processor is supposed to be i686 compatible. Some time ago after installing Ubuntu 6.10 on a MiniItx board with the same processor, I found out that the previous statement is not 100% true. The Ubuntu kernel hanged on startup due to the lack of some specific and optional instructions of the i686 set in the C3 processor. These instructions missing in C3 implementation of i686 set

What is a cache hit and a cache miss? Why would context-switching cause cache miss?

别来无恙 提交于 2019-11-29 20:07:53
From the 11th Chapter( Performance and Scalability ) and the section named Context Switching of the JCIP book : When a new thread is switched in, the data it needs is unlikely to be in the local processor cache, so a context-switch causes a flurry of cache misses, and thus threads run a little more slowly when they are first scheduled. Can someone explain in an easy to understand way the concept of cache miss and its probable opposite ( cache hit )? Why context-switching would cause a lot of cache miss? Can someone explain in an easy to understand way the concept of cache miss and its probable

Do all 64 bit intel architectures support SSSE3/SSE4.1/SSE4.2 instructions?

怎甘沉沦 提交于 2019-11-29 17:31:25
问题 I did searched on web and intel Software manual . But am unable to confirm if all Intel 64 architectures support upto SSSE3 or upto SSE4.1 or upto SSE4.2 or AVX etc. So that I would be able to use minimum SIMD supported instructions in my programme. Please help. 回答1: A x64 native (AMD64 or Intel 64) processor is only mandated to support SSE and SSE2. SSE3 is supported by Intel Pentium 4 processors (“Prescott”), AMD Athlon 64 (“revision E”), AMD Phenom, and later processors. This means most,

Address translation with multiple pagesize-specific TLBs

你。 提交于 2019-11-29 16:06:59
For Intel 64 and IA-32 processors, for both data and code independently, there may be both a 4KB TLB, and a Large Page (2MB, 1GB) TLB (LTLB). How does address translation work in this case? Would the hardware simply be able to access both in parallel, knowing that a double-hit can't occur? In the LTLBs, how would the entries be organized? I suppose, when the entry is originally filled from a page-structure entry, the LTLB entry could include information about how a hit on this entry would proceed? Anyone have a reference to a current microarchetucture? There are many possible designs for a TLB