x86-64

Can counting byte matches between two strings be optimized using SIMD?

[亡魂溺海] 提交于 2019-12-19 00:38:29
问题 Profiling suggests that this function here is a real bottle neck for my application: static inline int countEqualChars(const char* string1, const char* string2, int size) { int r = 0; for (int j = 0; j < size; ++j) { if (string1[j] == string2[j]) { ++r; } } return r; } Even with -O3 and -march=native , G++ 4.7.2 does not vectorize this function (I checked the assembler output). Now, I'm not an expert with SSE and friends, but I think that comparing more than one character at once should be

Why does this movq instruction work on linux and not osx?

血红的双手。 提交于 2019-12-18 19:17:15
问题 The following assembly code gives an error when running as on OSX 10.9.4, but works successfully on Linux (Debian 7.6). In particular, the movq instruction doesn't seem to like the label argument. $ cat test.S .globl _main _main: movq $_main, %rax ret Here is the error: $ as -o test.o test.S test.S:3:32-bit absolute addressing is not supported for x86-64 test.S:3:cannot do signed 4 byte relocation Changing $_main in line 3 to a literal like $10 works fine. The code had to be modified in a

Get file size with stat syscall

…衆ロ難τιáo~ 提交于 2019-12-18 18:07:22
问题 I'm trying to get file size wit stat syscall with assembly (nasm): section .data encodeFile db "/home/user/file" section .bss stat resb 64 struc STAT .st_dev: resd 1 .st_ino: resd 1 .st_mode: resw 1 .st_nlink: resw 1 .st_uid: resw 1 .st_gid: resw 1 .st_rdev: resd 1 .st_size: resd 1 .st_atime: resd 1 .st_mtime: resd 1 .st_ctime: resd 1 .st_blksize: resd 1 .st_blocks: resd 1 endstruc _start: mov rax, 4 mov rdi, encodeFile mov rsi, stat syscall mov eax, dword [stat + STAT.st_size] There is 0 in

Is it possible to change virtual memory page size?

纵然是瞬间 提交于 2019-12-18 17:04:14
问题 Is it possible to change the virtual memory page size? I'm asking this because in the X86_64 part of the MMU article on wikipedia, it talks about different page sizes. If the page size can indeed be changed, how it is changed? 回答1: On x86_64 you can explicitly request 2 MiB pages instead of the usual 4 KiB pages with the help of hugetlbfs. On modern kernels with transparent huge page support a small pages can automagically concatenated to huge pages in the background, given that the memory

C to assembly call convention 32bit vs 64bit

喜你入骨 提交于 2019-12-18 15:52:09
问题 I have been following the excellent book Programming Ground Up, wanting to learn assembly. Although not in the book at this point, I wanted to call my assembly function from C. on a 32 bit machine, this works as is when working from the book. What I do here is storing the first argument in %ebx and the second in %ecx . .type power, @function .globl power power: pushq %ebp movl %esp, %ebp subl $4, %esp movl 8(%ebp), %ebx movl 12(%ebp), %ecx I compile this (and the rest of the function) into an

How to interpret segment register accesses on x86-64?

我与影子孤独终老i 提交于 2019-12-18 15:51:43
问题 With this function: mov 1069833(%rip),%rax # 0x2b5c1bf9ef90 <_fini+3250648> add %fs:0x0,%rax retq How do I interpret the second instruction and find out what was added to RAX? 回答1: This code: mov 1069833(%rip),%rax # 0x2b5c1bf9ef90 <_fini+3250648> add %fs:0x0,%rax retq is returning the address of a thread-local variable. %fs:0x0 is the address of the TCB (Thread Control Block), and 1069833(%rip) is the offset from there to the variable, which is known since the variable resides either in the

What are .seh_* assembly commands that gcc outputs?

若如初见. 提交于 2019-12-18 14:53:47
问题 I use gcc -S for a hello world program. What are the 5 .seh_ commands? I can't seem to find much info at all about them when I search. .file "hi.c" .def __main; .scl 2; .type 32; .endef .section .rdata,"dr" .LC0: .ascii "Hello World\0" .text .globl main .def main; .scl 2; .type 32; .endef .seh_proc main main: pushq %rbp .seh_pushreg %rbp movq %rsp, %rbp .seh_setframe %rbp, 0 subq $32, %rsp .seh_stackalloc 32 .seh_endprologue call __main leaq .LC0(%rip), %rcx call puts movl $0, %eax addq $32,

Why don't GCC and Clang use cvtss2sd [memory]?

一笑奈何 提交于 2019-12-18 13:12:52
问题 I'm trying to optimize some code that's supposed to read single precision floats from memory and perform arithmetic on them in double precision. This is becoming a significant performance bottleneck, as the code that stores data in memory as single precision is substantially slower than equivalent code that stores data in memory as double precision. Below is a toy C++ program that captures the essence of my issue: #include <cstdio> // noinline to force main() to actually read the value from

How to load a pixel struct into an SSE register?

和自甴很熟 提交于 2019-12-18 12:25:21
问题 I have a struct of 8-bit pixel data: struct __attribute__((aligned(4))) pixels { char r; char g; char b; char a; } I want to use SSE instructions to calculate certain things on these pixels (namely, a Paeth transformation). How can I load these pixels into an SSE register as 32-bits unsigned integers? 回答1: Unpacking unsigned pixels with SSE2 Ok, using SSE2 integer intrinsics from <emmintrin.h> first load the thing into the lower 32 bits of the register: __m128i xmm0 = _mm_cvtsi32_si128(*

Prefetching data to cache for x86-64

寵の児 提交于 2019-12-18 12:16:58
问题 In my application, at one point I need to perform calculations on a large contiguous block of memory data (100s of MBs). What I was thinking was to keep prefetching the part of the block my program will touch in future, so that when I perform calculations on that portion, the data is already in the cache. Can someone give me a simple example of how to achieve this with gcc? I read _mm_prefetch somewhere, but don't know how to properly use it. Also note that I have a multicore system, but each