x86

Fast interleave 2 double arrays into an array of structs with 2 float and 1 int (loop invariant) member, with SIMD double->float conversion?

試著忘記壹切 提交于 2021-02-20 06:50:27
问题 I have a section of code which is a bottleneck in a C++ application running on x86 processors, where we take double values from two arrays, cast to float and store in an array of structs. The reason this is a bottleneck is it is called either with very large loops, or thousands of times. Is there a faster way to do this copy & cast operation using SIMD Intrinsics? I have seen this answer on faster memcpy but doesn't address the cast. The simple C++ loop case looks like this int _iNum; const

x86 BSWAP instruction REX doesn't follow Intel specs?

风格不统一 提交于 2021-02-20 06:27:43
问题 I've been assembling (and disassembling) the BSWAP x64 instruction with both NASM and GAS, and both assemble the instruction BSWAP r15 as 490FCF in hex. Disassemblers also disassemble this to the same instruction. The REX prefix for the instruction ( 49 ) thus has the REX.W bit (bit 3) and the REX.B bit (bit 0) set. This is directly in contrast to the Intel documentation, which states: In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R

x86 BSWAP instruction REX doesn't follow Intel specs?

≯℡__Kan透↙ 提交于 2021-02-20 06:25:41
问题 I've been assembling (and disassembling) the BSWAP x64 instruction with both NASM and GAS, and both assemble the instruction BSWAP r15 as 490FCF in hex. Disassemblers also disassemble this to the same instruction. The REX prefix for the instruction ( 49 ) thus has the REX.W bit (bit 3) and the REX.B bit (bit 0) set. This is directly in contrast to the Intel documentation, which states: In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R

mmap substitute for malloc

浪子不回头ぞ 提交于 2021-02-20 05:00:07
问题 I need to find a way to use mmap instead of malloc. How is this possible? (I am not using libc only syscalls) And yes brk() is possible. I used sbrk() but realized its not sys-call... (x86 inline assembly) I've been looking around and saw this: How to use mmap to allocate a memory in heap? But it didn't help for me, because I had a segfault. Basically, all I want to do a create 3 slabs of memory for storing characters. Say, char * x = malloc(1000); char * y = malloc(2000); char * z = malloc

Why is this int $0x10 BIOS INT not working on Linux?

隐身守侯 提交于 2021-02-20 00:43:42
问题 I am not sure if I am doing something drastically wrong. I am learning assembly language in AT&T syntax on a linux machine with intel chip. I learned that INT 10H is used to invoke BIOS subroutines for various video purposes. I wrote this simple assembly code to clear the screen. .section .data data_items: .section .text .global _start _start: mov $6, %ah # to select the scroll function mov $0, %al # the entire page mov $7, %bh # for normal attribute mov $0, %ch # row value of the start point

gcc inline assembly behave strangely

主宰稳场 提交于 2021-02-19 08:25:11
问题 I am learning GCC's extended inline assembly currently. I wrote an A + B function and wants to detect the ZF flag, but things behave strangely. The compiler I use is gcc 7.3.1 on x86-64 Arch Linux. I started from the following code, this code will correctly print the a + b . int a, b, sum; scanf("%d%d", &a, &b); asm volatile ( "movl %1, %0\n" "addl %2, %0\n" : "=r"(sum) : "r"(a), "r"(b) : "cc" ); printf("%d\n", sum); Then I simply added a variable to check flags, it gives me wrong output. int

QueryWorkingSet includes invalid pages in its result

若如初见. 提交于 2021-02-19 06:35:22
问题 I'm currently using a 64-bit Windows 7 with I'm using Windows 7. I'm playing around with some PSAPI (Process Status API) functions to learn a bit more about how Windows manages memory. I noticed, however, that QueryWorkingSet included entries from which I couldn't read (e.g. page 0, and you can't read 0x00000000 ). When trying it on 64-bit, it became apparent why this was the case: QueryWorkingSet is bugged on 32-bit, as the addresses are truncated (hence the multiple page 0 entries). Still,

GCC placing register args on the stack with a gap below local variables?

烂漫一生 提交于 2021-02-19 06:22:28
问题 I tried to look at the assembly code for a very simple program. int func(int x) { int z = 1337; return z; } With GCC -O0, every C variable has a memory address that's not optimized away, so gcc spills its register arg: (Godbolt, gcc5.5 -O0 -fverbose-asm) func: pushq %rbp # movq %rsp, %rbp #, movl %edi, -20(%rbp) # x, x movl $1337, -4(%rbp) #, z movl -4(%rbp), %eax # z, D.2332 popq %rbp # ret What is the reason that the function parameter x gets placed on the stack below the local variables?

GCC placing register args on the stack with a gap below local variables?

为君一笑 提交于 2021-02-19 06:22:26
问题 I tried to look at the assembly code for a very simple program. int func(int x) { int z = 1337; return z; } With GCC -O0, every C variable has a memory address that's not optimized away, so gcc spills its register arg: (Godbolt, gcc5.5 -O0 -fverbose-asm) func: pushq %rbp # movq %rsp, %rbp #, movl %edi, -20(%rbp) # x, x movl $1337, -4(%rbp) #, z movl -4(%rbp), %eax # z, D.2332 popq %rbp # ret What is the reason that the function parameter x gets placed on the stack below the local variables?

How does the communication between CPU happen?

六月ゝ 毕业季﹏ 提交于 2021-02-19 05:40:08
问题 Another question about L2/L3 caches explained that L3 can be used for inter process communication (IPC). Are there other methods/pathways for this communication to happen? The reason why it seems that there are other pathways is because Intel nearly halved the amount of L3 cache per core in their newest processor lineup (1.375 MiB per core in SKL-X) vs. previous generations (2.5 MiB per core in Broadwell EP). Per-core private L2 increased from 256k to 1M, though. 回答1: There are inter