instructions

How do ASCII Adjust and Decimal Adjust instructions work?

梦想与她 提交于 2019-11-30 05:17:19
问题 I've been struggling with understanding the ASCII adjust instructions from x86 assembly language. I see all over the internet information telling me different things, but I guess it's just the same thing explained in a different form that I still don't get. Can anyone explain why in the pseudo-code of AAA, AAS we have to add, subtract 6 from the low-order nibble in AL? And can someone explain AAM , AAD and the Decimal adjust instructions pseudo-code in the Intel instruction set manuals too,

Android instructions when open the application at first time? [closed]

和自甴很熟 提交于 2019-11-29 21:07:58
Do you know this Well I want create something like this screen. When I open for the first time the application I want open this screen and display a context.. How is possible? I don't know what search for this type of thing.. @Override public void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); ... if (isFirstTime()) { // What you do when the Application is Opened First time Goes here } ... } /*** * Checks that application runs first time and write flag at SharedPreferences * @return true if 1st time */ private boolean isFirstTime() { SharedPreferences preferences =

C code loop performance [continued]

别来无恙 提交于 2019-11-29 19:13:56
This question continues on my question here (on the advice of Mystical): C code loop performance Continuing on my question, when i use packed instructions instead of scalar instructions the code using intrinsics would look very similar: for(int i=0; i<size; i+=16) { y1 = _mm_load_ps(output[i]); … y4 = _mm_load_ps(output[i+12]); for(k=0; k<ksize; k++){ for(l=0; l<ksize; l++){ w = _mm_set_ps1(weight[i+k+l]); x1 = _mm_load_ps(input[i+k+l]); y1 = _mm_add_ps(y1,_mm_mul_ps(w,x1)); … x4 = _mm_load_ps(input[i+k+l+12]); y4 = _mm_add_ps(y4,_mm_mul_ps(w,x4)); } } _mm_store_ps(&output[i],y1); … _mm_store

How is it possible that BITWISE AND operation to take more CPU clocks than ARITHMETIC ADDITION operation in a C program?

半城伤御伤魂 提交于 2019-11-29 16:40:53
I wanted to test if bitwise operations really are faster to execute than arithmetic operation. I thought they were. I wrote a small C program to test this hypothesis and to my surprise the addition takes less on average than bitwise AND operation. This is surprising to me and I cannot understand why this is happening. From what I know for addition the carry from the less significant bits should be carried to the next bits because the result depends on the carry too. It does not make sense to me that a logic operator is slower than addition. My cod is below: #include<stdio.h> #include<time.h>

How to process a 24-bit 3 channel color image with SSE2/SSE3/SSE4?

醉酒当歌 提交于 2019-11-29 15:42:58
问题 I just started to use SS2 optimization of image processing, but for the 3 channel 24 bit color images have no idea. My pix data arranged by BGR BGR BGR ... ,unsigned char 8-bi, so if I want to implement the Color2Gray with SSE2/SSE3/SSE4's instruction C/C++ fun ,how would I do? Does need to align(4/8/16) for my pix data? I have read article:http://supercomputingblog.com/windows/image-processing-with-sse/ But it is ARGB 4 channel 32-bit color,exactly process 4 color pix data every time. Thanks

What is the significance of operations on the register EAX having their own opcodes?

若如初见. 提交于 2019-11-29 14:09:17
If you look at documentation of operations like cmp , test , add , sub , and and , you will notice that operations that involve register EAX and its 16 and 8 bit variants as the first operand have a distinct opcode which is different from the "general case" version of these instructions. Is this separate opcode merely a way to save code space, is it at all more efficient than the general-case opcode, or is it just some relic of the past that isn't worth shaking off for compatibility reasons? This is primarily a relic of the past, but not exactly "obsolete" either. In the early days ( i.e. , on

Tracing/profiling instructions

一笑奈何 提交于 2019-11-29 01:19:30
问题 I'd like to statistically profile my C code at the instruction level. I need to know how many additions, multiplications, divisions, etc I'm performing. This is not your usual run of the mill code profiling requirement. I'm an algorithm developer and I want to estimate the cost of converting my code to hardware implementations. For this, I'm being asked the instruction call breakdown during run-time (parsing the compiled assembly isn't sufficient as it doesn't consider loops in the code).

C code loop performance

二次信任 提交于 2019-11-28 15:31:52
I have a multiply-add kernel inside my application and I want to increase its performance. I use an Intel Core i7-960 (3.2 GHz clock) and have already manually implemented the kernel using SSE intrinsics as follows: for(int i=0; i<iterations; i+=4) { y1 = _mm_set_ss(output[i]); y2 = _mm_set_ss(output[i+1]); y3 = _mm_set_ss(output[i+2]); y4 = _mm_set_ss(output[i+3]); for(k=0; k<ksize; k++){ for(l=0; l<ksize; l++){ w = _mm_set_ss(weight[i+k+l]); x1 = _mm_set_ss(input[i+k+l]); y1 = _mm_add_ss(y1,_mm_mul_ss(w,x1)); … x4 = _mm_set_ss(input[i+k+l+3]); y4 = _mm_add_ss(y4,_mm_mul_ss(w,x4)); } } _mm

How to compile Busybox?

匆匆过客 提交于 2019-11-28 02:09:41
问题 (The i9100 and i9100p phones have Exynos 4210 SoC which includes Cortex A9 dual core 1.2Ghz processor which supports NEON.) I will compile the latest busybox source snapshot available and upload it for everyone for free on internet and maybe even make my own free BusyboxInstaller.apk (I already downloaded today's 14th March snapshot from the official website) because so many busybox installers have very outdated versions and I want to take advantage of possible optimizations for the Cortex A9

How is x86 instruction cache synchronized?

心不动则不痛 提交于 2019-11-27 19:06:01
I like examples, so I wrote a bit of self-modifying code in c... #include <stdio.h> #include <sys/mman.h> // linux int main(void) { unsigned char *c = mmap(NULL, 7, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE| MAP_ANONYMOUS, -1, 0); // get executable memory c[0] = 0b11000111; // mov (x86_64), immediate mode, full-sized (32 bits) c[1] = 0b11000000; // to register rax (000) which holds the return value // according to linux x86_64 calling convention c[6] = 0b11000011; // return for (c[2] = 0; c[2] < 30; c[2]++) { // incr immediate data after every run // rest of immediate data (c[3:6]) are