intel | 易学教程

Intel instruction set: multiply with EAX, EBX, ECX or EDX?

阅读更多关于 Intel instruction set: multiply with EAX, EBX, ECX or EDX?

问题 How do you suppose to know that when 'mul ecx' was executed. ECX would be multiplied with EAX? And not with EBX or EDX? mul ecx, eax would make more sense though. 回答1: The instruction set is simply defined that way. Intel could have defined it in other ways, including ways that would have allowed you to completely specify input and output registers, but they did not. The excuse is arguably that at the time the various multiply instructions were added to the instruction set of the CPU (8086

Intel x86_64 assembly, how to floor double from xmm register to int?

阅读更多关于 Intel x86_64 assembly, how to floor double from xmm register to int?

问题 How can I do that? It would be best if result will be in e*x register. 回答1: You've asked several trivial questions which you could answer by just looking at how a C compiler does it. From there, you can look up the instructions it used, and decide which ones you want to actually use. (There are about a zillion different rounding functions in libm, so picking the right one in the first place isn't always easy). Using -O3 -ffast-math gets most simple libm functions inlined (since it doesn't

Many OpenCL SDK's. Which of them i should choose?

阅读更多关于 Many OpenCL SDK's. Which of them i should choose?

问题 In my computer with Windows 7 OS I have three versions of OpenCL SDKS's from this vendors: Intel NVIDIA AMD. I build my application with each of them. As the output I have three different binaries. For example: my_app_intel_x86, my_app_amd_x86, my_app_nvidia_x86 This binaries are different on this: They use different SDK's in likange process They try to find different OpenCL platform name in runtime Can I use only one SDK and check platform on running time? 回答1: SDK's give debuggings tools, a

Perf tool stat output: multiplex and scaling of “cycles”

阅读更多关于 Perf tool stat output: multiplex and scaling of “cycles”

问题 I am trying to understand the multiplex and scaling of "cycles" event in the "perf" output. The following is the output of perf tool: 144094.487583 task-clock (msec) # 1.017 CPUs utilized 539912613776 instructions # 1.09 insn per cycle (83.42%) 496622866196 cycles # 3.447 GHz (83.48%) 340952514 cache-misses # 10.354 % of all cache refs (83.32%) 3292972064 cache-references # 22.854 M/sec (83.26%) 144081.898558 cpu-clock (msec) # 1.017 CPUs utilized 4189372 page-faults # 0.029 M/sec 0 major

Uses of the monitor/mwait instructions

阅读更多关于 Uses of the monitor/mwait instructions

问题 I happened to stumble upon these two instructions - mwait and monitor https://www.felixcloutier.com/x86/mwait. The intel manual says these are used to wait for writes in a concurrent multi-processor system, and it made me curious what types of usecases were in mind when these instructions were added to the ISA. What are the semantics of these instructions? Is this integrated through linux into the threading libraries provided by posix (eg. does the thread yield while monitoring a word)? Or

Is there a more direct method to convert float to int with rounding than adding 0.5f and converting with truncation?

阅读更多关于 Is there a more direct method to convert float to int with rounding than adding 0.5f and converting with truncation?

问题 Conversion from float to int with rounding happens fairly often in C++ code that works with floating point data. One use, for example, is in generating conversion tables. Consider this snippet of code: // Convert a positive float value and round to the nearest integer int RoundedIntValue = (int) (FloatValue + 0.5f); The C/C++ language defines the (int) cast as truncating, so the 0.5f must be added to ensure rounding up to the nearest positive integer (when the input is positive). For the

What are fast LEA and slow LEA unit in the microarchitecture of Inte's CPU?

阅读更多关于 What are fast LEA and slow LEA unit in the microarchitecture of Inte's CPU?

问题 I saw haswell's microarchitecture from below link mentioned that some pipelines can carry on fast LEA and some can run slow LEA, what's the meaning of fast LEA and slow LEA here? Related to LEA instruction? The search results are usually biases to LEA instruction, and don't lead to direct answer. http://www.realworldtech.com/haswell-cpu/4/ 回答1: Most dedicated ALU units exist only on part of the execution ports (with constant changes being made from generation to generation), the CPU has to

Installed beignet to use OpenCL on Intel, but OpenCL programs only work when run as root

阅读更多关于 Installed beignet to use OpenCL on Intel, but OpenCL programs only work when run as root

问题 I have an Intel HD graphics 4000 3rd Gen Processor, and my OS is Linux Mint 17.1 64 bit. I installed beignet to be able to use OpenCL and thus run programs on the GPU. I had been having lots of problems using the pyOpenCL bindings, so I just decided to uninstall my current beignet version and install the latest one (You can see the previous question I asked and answered myself about it here). Upgrading beignet worked and I can now run OpenCL code on my GPU through python and C/C++ bindings.

Difference between trap flag (TF) and monitor trap flag?

阅读更多关于 Difference between trap flag (TF) and monitor trap flag?

问题 Debugging features like GDB work by setting the TF flag of eflags register which causes an exception after every execution of instruction by the processor, letting tools like gdb control over the debugging.When we are running a virtual machine Ex in case of kvm to do the same thing you need to set a flag called the MONITOR TRAP FLAG (pg 15 of current intel software manual 3c), which will cause the virtual macine to EXIT (VMEXIT) after every instruction giving debugging abitily to the

In which condition DCU prefetcher start prefetching?

阅读更多关于 In which condition DCU prefetcher start prefetching?

问题 I am reading about different prefetcher available in Intel Core i7 system. I have performed experiments to understand when these prefetchers are invoked. These are my findings L1 IP prefetchers starts prefetching after 3 cache misses. It only prefetch on cache hit. L2 Adjacent line prefetcher starts prefetching after 1st cache miss and prefetch on cache miss. L2 H/W (stride) prefetcher starts prefetching after 1st cache miss and prefetch on cache hit. I am not able to understand the behavior