cpu-architecture

Why does false sharing still affect non atomics, but much less than atomics?

别来无恙 提交于 2020-06-16 18:58:29
问题 Consider the following example that proves false sharing existence: using type = std::atomic<std::int64_t>; struct alignas(128) shared_t { type a; type b; } sh; struct not_shared_t { alignas(128) type a; alignas(128) type b; } not_sh; One thread increments a by steps of 1, another thread increments b . Increments compile to lock xadd with MSVC, even though the result is unused. For a structure where a and b are separated, the values accumulated in a few seconds is about ten times greater for

Why is the branch delay slot deprecated or obsolete?

梦想的初衷 提交于 2020-06-12 06:40:30
问题 When I reading RISC-V User-Level ISA manual,I noticed that it said that "OpenRISC has condition codes and branch delay slots, which complicate higher performance implementations." so RISC-V don't have branch delay slot RISC-V User-Level ISA manual link. Moreover,Wikipedia said that most of newer RISC design omit branch delay slot. Why most of newer RISC Architecture gradually omit branch delay slot? 回答1: Citing Henessy and Patterson (Computer architecture and design, 5th ed.) Fallacy : You

Why does CLFLUSH exist in x86?

眉间皱痕 提交于 2020-06-09 17:57:45
问题 I recently learned about the row hammer attack. In order to perform this attack the programmer needs to flush the complete cache hierarchy of a CPU for a specific number of addresses. My question is: why is CLFLUSH necessary in x86? What are the reasons for ever using this instruction, if all L* caches act transparently (i.e., no explicit cache invalidation needed)? Besides that: isn't the CPU free to speculate memory access patterns, and thereby ignore the instruction altogether? 回答1: I

How modern X86 processors actually compute multiplications?

女生的网名这么多〃 提交于 2020-06-08 18:44:51
问题 I was watching some lecture on algorithms, and the professor used multiplication as an example of how naive algorithms can be improved... It made me realize that multiplication is not that obvious, although when I am coding I just consider it a simple atomic operation, multiplication requires a algorithm to run, it does not work like summing numbers. So I wonder, what algorithm modern desktop processors actually use? I guess they don't rely on logarithm tables, and don't make loops with

How modern X86 processors actually compute multiplications?

穿精又带淫゛_ 提交于 2020-06-08 18:42:09
问题 I was watching some lecture on algorithms, and the professor used multiplication as an example of how naive algorithms can be improved... It made me realize that multiplication is not that obvious, although when I am coding I just consider it a simple atomic operation, multiplication requires a algorithm to run, it does not work like summing numbers. So I wonder, what algorithm modern desktop processors actually use? I guess they don't rely on logarithm tables, and don't make loops with

Are two store buffer entries needed for split line/page stores on recent Intel?

心不动则不痛 提交于 2020-06-08 16:57:10
问题 It is generally understood that one store buffer entry is allocated per store, and this store buffer entry holds the store data and physical address 1 . In the case that a store crosses a 4096-byte page boundary, two different translations may be needed, one for each page, and hence two different physical addresses may need to be stored. Does this mean that page-crossing stores take 2 store buffer entries? If so, does it apply also to line-crossing stores? 1 ... and perhaps some/all of the

How to tell length of an x86-64 instruction opcode using CPU itself?

亡梦爱人 提交于 2020-06-08 12:19:13
问题 I know that there are libraries that can "parse" binary machine code / opcode to tell the length of an x86-64 CPU instruction. But I'm wondering, since CPU has internal circuitry to determine this, is there a way to use processor itself to tell the instruction size from a binary code? (Maybe even a hack?) 回答1: The Trap Flag (TF) in EFLAGS/RFLAGS makes the CPU single-step, i.e. take an exception after running one instruction. So if you write a debugger, you can use the CPU's single-stepping

How to tell length of an x86-64 instruction opcode using CPU itself?

房东的猫 提交于 2020-06-08 12:19:06
问题 I know that there are libraries that can "parse" binary machine code / opcode to tell the length of an x86-64 CPU instruction. But I'm wondering, since CPU has internal circuitry to determine this, is there a way to use processor itself to tell the instruction size from a binary code? (Maybe even a hack?) 回答1: The Trap Flag (TF) in EFLAGS/RFLAGS makes the CPU single-step, i.e. take an exception after running one instruction. So if you write a debugger, you can use the CPU's single-stepping

How to tell length of an x86-64 instruction opcode using CPU itself?

我是研究僧i 提交于 2020-06-08 12:18:12
问题 I know that there are libraries that can "parse" binary machine code / opcode to tell the length of an x86-64 CPU instruction. But I'm wondering, since CPU has internal circuitry to determine this, is there a way to use processor itself to tell the instruction size from a binary code? (Maybe even a hack?) 回答1: The Trap Flag (TF) in EFLAGS/RFLAGS makes the CPU single-step, i.e. take an exception after running one instruction. So if you write a debugger, you can use the CPU's single-stepping

When use write-through cache policy for pages

£可爱£侵袭症+ 提交于 2020-05-30 03:37:05
问题 I was reading the MDS attack paper RIDL: Rogue In-Flight Data Load. The set pages as write-back, write-through, write-combined or uncacheable and with different experiments determines that the Line Fill Buffer is the cause of the micro-architectural leaks. On a tangent: I was aware that memory can be uncacheable, but I assumed that cacheable data was always cached in a write-back cache, i.e. I assumed that the L1, L2 and LLC were always write-back caches. I read up on the differences between