inline-assembly | 易学教程

Questions about the performance of different implementations of strlen [closed]

阅读更多关于 Questions about the performance of different implementations of strlen [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I have implemented the strlen() function in different ways, including SSE2 assembly , SSE4.2 assembly and SSE2 intrinsic , I also exerted some experiments on them, with strlen() in <string.h> and strlen() in glibc . However, their performance in terms of milliseconds (time) are unexpected. My experiment

Reading a register value into a C variable

阅读更多关于 Reading a register value into a C variable

问题 I remember seeing a way to use extended gcc inline assembly to read a register value and store it into a C variable. I cannot though for the life of me remember how to form the asm statement. 回答1: Editor's note: this way of using a local register-asm variable is now documented by GCC as "not supported" . It still usually happens to work on GCC, but breaks with clang. (This wording in the documentation was added after this answer was posted, I think.) The global fixed-register variable version

How can I indicate that the memory pointed to by an inline ASM argument may be used?

阅读更多关于 How can I indicate that the memory *pointed* to by an inline ASM argument may be used?

问题 Consider the following small function: void foo(int* iptr) { iptr[10] = 1; __asm__ volatile ("nop"::"r"(iptr):); iptr[10] = 2; } Using gcc, this compiles to: foo: nop mov DWORD PTR [rdi+40], 2 ret Note in particular, that the first write to iptr , iptr[10] = 1 doesn't occur at all: the inline asm nop is the first thing in the function, and only the final write of 2 appears (after the ASM call). Apparently the compiler decides that it only needs to provide an up-to-date version of the value of

Address of labels (MSVC)

阅读更多关于 Address of labels (MSVC)

问题 We are writing a byte-code for a high-level compiled language, and after a bit of profiling and optimization, it became clear that the current largest performance overhead is the switch statement we're using to jump to the byte-code cases. We investigated pulling out the address of each case label and storing it in the stream of byte-code itself, rather than the instruction ID that we usually switch on. If we do that, we can skip the jump table, and directly jump to the location of code of

How to set a variable in GCC with Intel syntax inline assembly?

阅读更多关于 How to set a variable in GCC with Intel syntax inline assembly?

问题 Why doesn't this code set temp to 1? How do I actually do that? int temp; __asm__( ".intel_syntax;" "mov %0, eax;" "mov eax, %1;" ".att_syntax;" : : "r"(1), "r"(temp) : "eax"); printf("%d\n", temp); 回答1: You want temp to be an output, not an input, I think. Try: __asm__( ".intel_syntax;" "mov eax, %1;" "mov %0, eax;" ".att_syntax;" : "=r"(temp) : "r"(1) : "eax"); 回答2: This code does what you are trying to achieve. I hope this helps you: #include <stdio.h> int main(void) { /* Compile with C99

How can I accurately benchmark unaligned access speed on x86_64

阅读更多关于 How can I accurately benchmark unaligned access speed on x86_64

问题 In an answer, I've stated that unaligned access has almost the same speed as aligned access a long time (on x86/x86_64). I didn't have any numbers to back up this statement, so I've created a benchmark for it. Do you see any flaws in this benchmark? Can you improve on it (I mean, to increase GB/sec, so it reflects the truth better)? #include <sys/time.h> #include <stdio.h> template <int N> __attribute__((noinline)) void loop32(const char *v) { for (int i=0; i<N; i+=160) { __asm__ ("mov (%0),

Why does calling the C abort() function from an x86_64 assembly function lead to segmentation fault (SIGSEGV) instead of an abort signal?

阅读更多关于 Why does calling the C abort() function from an x86_64 assembly function lead to segmentation fault (SIGSEGV) instead of an abort signal?

问题 Consider the program: main.c #include <stdlib.h> void my_asm_func(void); __asm__( ".global my_asm_func;" "my_asm_func:;" "call abort;" "ret;" ); int main(int argc, char **argv) { if (argv[1][0] == '0') { abort(); } else if (argv[1][0] == '1') { __asm__("call abort"); } else { my_asm_func(); } } Which I compile as: gcc -ggdb3 -O0 -o main.out main.c Then I have: $ ./main.out 0; echo $? Aborted (core dumped) 134 $ ./main.out 1; echo $? Aborted (core dumped) 134 $ ./main.out 2; echo $?

Avoid optimizing away variable with inline asm

阅读更多关于 Avoid optimizing away variable with inline asm

问题 I was reading Preventing compiler optimizations while benchmarking that describes how clobber() and escape() from Chandler Carruths talk CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!" affects the compiler. From reading that, I assumed that if I have an input constraint like "g"( val ), then the compiler wouldn't be able to optimize away val . But in g() below, no code is generated. Why? How can doNotOptimize() be rewritten to ensure code is generated

Edit Memory Address Via C#, How to set the statement? [duplicate]

阅读更多关于 Edit Memory Address Via C#, How to set the statement? [duplicate]

问题 This question already has answers here : Edit Memory Address via c# (2 answers) Closed 6 years ago . i want to edit an active app (edit a memory address), on address 00498D45 i want to edit its value currect value : MOV BYTE PTR SS:[EBP-423],7 to updated value: MOV BYTE PTR SS:[EBP-423],8 what i got till now is this (searched about it on the net and this how far i got): thanks in advance! now using this code: how it should be look like? WriteMemory(Process process,00498D45 , MOV BYTE PTR SS:

Is it possible to put assembly instructions into CUDA code?

阅读更多关于 Is it possible to put assembly instructions into CUDA code?

问题 I want to use assembly code in CUDA C code in order to reduce expensive executions as we do using asm in c programming. Is it possible? 回答1: No, you can't, there is nothing like the asm constructs from C/C++. What you can do is tweak the generated PTX assembly and then use it with CUDA. See this for an example. But for GPUs, assembly optimizations are NOT necessary, you should do other optimizations first, such as memory coalescency and occupancy. See the CUDA Best Practices guide for more