cpu-architecture

Lightweight method to use Amd64 instructions under 32-bit Windows?

人走茶凉 提交于 2019-12-12 10:43:12
问题 For some CPU-bound code using 64-bit variables, it is beneficial to use the Amd64 instruction set rather than x86. How can it be done under 32-bit Windows (e.g. Windows XP SP3)? Of course I assume a modern, Amd64-enabled CPU. I'm excluding the working but heavyweight method: running a full-blown 64-bit OS as a virtual machine, e.g. Ubuntu for Amd64 under Virtualbox. I understand some assembly is needed, and there will restrictions, in particular addressing more memory than 32-bit Windows

Cache coherence literature generally only refers store buffers but not read buffers. Yet one somehow needs both?

六眼飞鱼酱① 提交于 2019-12-12 08:58:56
问题 When reading about consistency models (namely the x86 TSO), authors in general resort to models where there are a bunch of CPUs, their associated store buffers and their private caches. If my understanding is correct, store buffers can be described as queues where CPUs may put any store instruction they want to commit to memory. So as the name states, they are store buffers. But when I read those papers, they tend to talk about the interaction of loads and stores, with statements such as "a

Machine code alignment

三世轮回 提交于 2019-12-12 08:55:11
问题 I am trying to understand the principles of machine code alignment. I have an assembler implementation which can generate machine code in run-time. I use 16-bytes alignment on every branch destination, but looks like it is not the optimal choice, since I've noticed that if I remove alignment than sometimes same code works faster. I think that something to do with cache line width, so that some commands are cut by a cache line and CPU experiences stalls because of that. So if some bytes of

What happened to the L4 cache?

二次信任 提交于 2019-12-12 08:54:34
问题 There isn't a lot of information about the L4 cache, but as far as I know, it was used in the 4th and 5th generation of Intel processors(2013-2014), but it's gone from the current generation. Was the L4 bad, ineffective or something? 回答1: For Haswell and Broadwell, eDRAM L4 cache tags are resident in the on-chip L3 cache. Although this setup simplifies the LLC design and allows earlier tag checking for fetches from the processor, it makes the accessing to eDRAM LLC from other devices (e.g.,

How to deal with linker error : error-cannot find -lgcc

ε祈祈猫儿з 提交于 2019-12-12 07:46:01
问题 this is my makefile: task0 : main.o numbers.o add.o gcc -m32 -g -Wall -o task0 main.o numbers.o add.o main.o : main.c gcc -g -Wall -m32 -ansi -c -o main.c numbers.o : numbers.c gcc -g -Wall -m32 -ansi -c -o numbers.c add.o: add.s nasm -g -f elf -w+all -o add.o add.s clean : rm -f *.o task0 and this is the terminal output: gcc -m32 -g -Wall -o task0 main.o numbers.o add.o /usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-linux-gnu/4.8/libgcc.a when searching for -lgcc /usr/bin/ld: cannot

Avoid stalling pipeline by calculating conditional early

瘦欲@ 提交于 2019-12-12 07:26:51
问题 When talking about the performance of ifs, we usually talk about how mispredictions can stall the pipeline. The recommended solutions I see are: Trust the branch predictor for conditions that usually have one result; or Avoid branching with a little bit of bit-magic if reasonably possible; or Conditional moves where possible. What I couldn't find was whether or not we can calculate the condition early to help where possible. So, instead of: ... work if (a > b) { ... more work } Do something

Exactly how “fast” are modern CPUs?

夙愿已清 提交于 2019-12-12 07:08:42
问题 When I used to program embedded systems and early 8/16-bit PCs (6502, 68K, 8086) I had a pretty good handle on exacly how long (in nanoseconds or microseconds) each instruction took to execute. Depending on family, one (or four) cycles equated to one "memory fetch", and without caches to worry about, you could guess timings based on the number of memory accesses involved. But with modern CPU's, I'm confused. I know they're a lot faster, but I also know that the headline gigahertz speed isn't

How do I map a memory address to a block when there is an offset in a direct-mapped cache?

我的未来我决定 提交于 2019-12-12 04:28:32
问题 To start off, the first cache has 16 one-word blocks. As an example I will use 0x03 memory reference. The index has 4 bits (0011). It is clear that the bits equal 3mod16 (0011 = 0x03 = 3). However I am getting confused using this mod equation to determine block location in a cache with offset bits. The second cache has a total size of eight two-word blocks. This means that there is 1 offset bit. Since there are now 8 blocks, there are only 3 index bits. As an example, I will take the same

Necessity of J vs. JAL (and JR vs. JALR) in MIPS assembly

让人想犯罪 __ 提交于 2019-12-12 03:56:08
问题 I signed up because I've been googling forever for an answer to this question and can't find one. I'd like to know if the jump instructions WITHOUT linking are strictly necessary in MIPS? I can imagine for example that using "AL" versions when not required would incur some power penalty, but is there any situation (that's not completely contrived or could be coded around relatively simply) where ONLY J/JR would work? Thank you! 回答1: Formalizing the comments into an answer J / JR can be

Dependency chain analysis

拈花ヽ惹草 提交于 2019-12-12 02:12:25
问题 From Agner Fog's "Optimizing Assembly" guide, Section 12.7: a loop example. One of the paragraphs discussing the example code: [...] Analysis for Pentium M: ... 13 uops at 3 per clock = one iteration per 4.33c retirement time. There is a dependency chain in the loop. The latencies are: 2 for memory read, 5 for multiplication, 3 for subtraction, and 3 for memory write, which totals 13 clock cycles. This is three times as much as the retirement time but it is not a loop-carried dependence