intel | 易学教程

Understanding intel SUB instruction

阅读更多关于 Understanding intel SUB instruction

问题 I am currently trying to deepen my understanding of assembly code and I am stuck since weeks with a seemingly simple instruction : sub al, BYTE PTR [ebp+4] Assuming eax = 0x11223300 and BYTE PTR [ebp+4] = 0xaa what is the value of eax after the above instruction ? From what I understand, al can only affect the last byte in eax ( 0x00 in this case) so the program tries to compute 0x00 - 0xaa . But the result being negative, I don't get if the result would simply be 0x00 or if numbers are

Getting “cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 220 (OpenCL 2.2)” warning during runtime

阅读更多关于 Getting “cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 220 (OpenCL 2.2)” warning during runtime

问题 Following this and this posts, I'm compiling the main.c code on this GitHub Gist. Running CMake command find_package(OpenCL REQUIRED) I get this: -- Looking for CL_VERSION_2_2 - found -- Found OpenCL: C:/Program Files (x86)/IntelSWTools/system_studio_2020/OpenCL/sdk/lib/x86/OpenCL.lib (found version "2.2") indicating that an OpenCL SDK version 2.2 was found. This is in contradiction with what I get from clinfo tool, detecting a 1.2 OpenCL for Intel's SDK/platforms. Now when running the

Intel JCC Erratum - should JCC really be treated separately?

阅读更多关于 Intel JCC Erratum - should JCC really be treated separately?

问题 Intel pushed microcode update to fix error called "Jump Conditional Code (JCC) Erratum". The update microcode caused some operation to be inefficient due to disabling putting code to ICache under certain conditions. Published document, titled Mitigations for Jump Conditional Code Erratum lists not only JCC , it lists: unconditional jumps, conditional jumps, macro-fused conditional jumps, calls, and return. MSVC switch /QIntel-jcc-erratum documentation mentions: Under /QIntel-jcc-erratum, the

Installing PyOpenCL on Windows using Intel's SDK and pip

阅读更多关于 Installing PyOpenCL on Windows using Intel's SDK and pip

问题 Following these instructions, I have downloaded and installed Intel's OpenCL™ SDK (Intel® System Studio) from here. The cl.h file is in the folder C:\Program Files (x86)\IntelSWTools\system_studio_2020\OpenCL\sdk\include\CL however when running pip install pyopencl I get the long error message of Building wheel for pyopencl (PEP 517) ... error ERROR: Command errored out with exit status 1: command: 'c:\python38\python.exe' 'c:\python38\lib\site-packages\pip\_vendor\pep517\_in_process.py'

Are two store buffer entries needed for split line/page stores on recent Intel?

阅读更多关于 Are two store buffer entries needed for split line/page stores on recent Intel?

问题 It is generally understood that one store buffer entry is allocated per store, and this store buffer entry holds the store data and physical address 1 . In the case that a store crosses a 4096-byte page boundary, two different translations may be needed, one for each page, and hence two different physical addresses may need to be stored. Does this mean that page-crossing stores take 2 store buffer entries? If so, does it apply also to line-crossing stores? 1 ... and perhaps some/all of the

How do I monitor the amount of SIMD instruction usage

阅读更多关于 How do I monitor the amount of SIMD instruction usage

问题 How can I monitor the amount of SIMD (SSE, AVX, AVX2, AVX-512) instruction usage of a process? For example, htop can be used to monitor general CPU usage, but not specifically SIMD instruction usage. 回答1: I think the only reliable way to count all SIMD instructions (not just FP math) is dynamic instrumentation (e.g. via something like Intel PIN / SDE). See How to characterize a workload by obtaining the instruction type breakdown? and How do I determine the number of x86 machine instructions

Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

阅读更多关于 Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

问题 Writing a ZMM register can leave a Skylake-X (or similar) CPU in a state of reduced max-turbo indefinitely. (SIMD instructions lowering CPU frequency and Dynamically determining where a rogue AVX-512 instruction is executing) Presumably Ice Lake is similar. ( Workaround: not a problem for zmm16..31 , according to @BeeOnRope's comments which I quoted in Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions? So this strlen could just use vpxord xmm16,xmm16,xmm16

Optimizing an incrementing ASCII decimal counter in video RAM on 7th gen Intel Core

阅读更多关于 Optimizing an incrementing ASCII decimal counter in video RAM on 7th gen Intel Core

问题 I'm trying to optimize the following subroutine for a specific Kaby Lake CPU (i5-7300HQ), ideally to make the code at least 10 times faster compared to its original form. The code runs as a floppy-style bootloader in 16-bit real mode. It displays a ten digit decimal counter on screen, counting 0 - 9999999999 and then halting. I have taken a look at Agner's Optimization Guides for Microarchitecture and Assembly, Instruction Performance Table and Intel's Optimization Reference Manual. Only

32-byte aligned routine does not fit the uops cache

阅读更多关于 32-byte aligned routine does not fit the uops cache

问题 KbL i7-8550U I'm researching the behavior of uops-cache and came across a misunderstanding regarding it. As specified in the Intel Optimization Manual 2.5.2.2 (emp. mine): The Decoded ICache consists of 32 sets. Each set contains eight Ways. Each Way can hold up to six micro-ops. - All micro-ops in a Way represent instructions which are statically contiguous in the code and have their EIPs within the same aligned 32-byte region. - Up to three Ways may be dedicated to the same 32-byte aligned

32-byte aligned routine does not fit the uops cache

阅读更多关于 32-byte aligned routine does not fit the uops cache