cpu

Performance 32 bit vs. 64 bit arithmetic

纵饮孤独 提交于 2019-11-28 11:54:29
Are native 64 bit integer arithmetic instructions slower than their 32 bit counter parts (on x86_64 machine with 64 bit OS)? Edit: On current CPUs such Intel Core2 Duo, i5/i7 etc. It depends on the exact CPU and operation. On 64-bit Pentium IVs, for example, multiplication of 64-bit registers was quite a bit slower. Core 2 and later CPUs have been designed for 64-bit operation from the ground up. Generally, even code written for a 64-bit platform uses 32-bit variables where values will fit in them. This isn't primarily because arithmetic is faster (on modern CPUs, it generally isn't) but

Profiling CPU Cache/Memory from the OS/Application?

◇◆丶佛笑我妖孽 提交于 2019-11-28 11:18:33
I wish to write software which could essentially profile the CPU cache (L2,L3, possibly L1) and the memory, to analyze performance. Am I right in thinking this is un-doable because there is no access for the software to the cache content? Another way of wording my Q: is there any way to know, from the OS/Application level, what data has been loaded into cache/memory? EDIT: Operating System Windows or Linux and CPU Intel Desktop/Xeon You might want to look at Intel's PMU i.e. Performance Monitoring Unit. Some processors have one. It is a bunch of special purpose registers (Intel calls them

Why misaligned address access incur 2 or more accesses?

元气小坏坏 提交于 2019-11-28 10:29:41
The normal answers to why data alignment is to access more efficiently and to simplify the design of CPU. A relevant question and its answers is here . And another source is here . But they both do not resolve my question. Suppose a CPU has a access granularity of 4 bytes. That means the CPU reads 4 bytes at a time. The material I listed above both says that if I access a misaligned data, say address 0x1, then the CPU has to do 2 accesses (one from addresses 0x0, 0x1, 0x2 and 0x3, one from addresses 0x4, 0x5, 0x6 and 0x7) and combine the results. I can't see why. Why just can't CPU read data

PInvoke for GetLogicalProcessorInformation Function

↘锁芯ラ 提交于 2019-11-28 10:11:01
I want to call via c#/PInvoke the GetLogicalProcessorInformation function, but I'm stuck with SYSTEM_LOGICAL_PROCESSOR_INFORMATION struct and CACHE_DESCRIPTOR struct. How should I define these structs for correct usage? Main problems: 1. SYSTEM_LOGICAL_PROCESSOR_INFORMATION has union in its definition 2. SYSTEM_LOGICAL_PROCESSOR_INFORMATION has ULONGLONG in its definition 3. CACHE_DESCRIPTOR has WORD and DWORD in its definition. Can you help me with these structures? Updated : fixed the structure marshalling which has to be done manually. This is quite a messy P/invoke. Even when you have the

MultiCore CPUs, Multithreading and context switching?

ぐ巨炮叔叔 提交于 2019-11-28 10:08:59
Let's say we have a CPU with 20 cores and a process with 20 CPU-intensive independent of each other threads: One thread per CPU core. I'm trying to figure out whether context switching happens in this case. I believe it happens because there are system processes in the operating system that need CPU-time too. I understand that there are different CPU architectures and some answers may vary but can you please explain: How context switching happens e.g. on Linux or Windows and some known CPU architectures? And what happens under the hood on modern hardware? What if we have 10 cores and 20

Getting current cpu usage in c++/windows for particular process

可紊 提交于 2019-11-28 09:57:04
问题 I want to calculate current cpu usage for particular application in my code. I looked up on internet and found pdh library for windows. When I tried it I am getting overall cpu usage not cpu usage for one process. PdhAddCounter(hquery, TEXT("\\Processor(_Total)\\% Processor Time"),0,&counter); So what I do with this line to get cpu usage for particular process? I tried replacing _Total with process name(explorer). At that time I am getting 0 cpu usage. But I checked in resource monitor that

Strange JIT pessimization of a loop idiom

你说的曾经没有我的故事 提交于 2019-11-28 09:03:58
While analyzing the results of a recent question here , I encountered a quite peculiar phenomenon: apparently an extra layer of HotSpot's JIT-optimization actually slows down execution on my machine. Here is the code I have used for the measurement: @OutputTimeUnit(TimeUnit.NANOSECONDS) @BenchmarkMode(Mode.AverageTime) @OperationsPerInvocation(Measure.ARRAY_SIZE) @Warmup(iterations = 2, time = 1) @Measurement(iterations = 5, time = 1) @State(Scope.Thread) @Threads(1) @Fork(2) public class Measure { public static final int ARRAY_SIZE = 1024; private final int[] array = new int[ARRAY_SIZE];

Python multithreading max_workers

假如想象 提交于 2019-11-28 09:00:34
问题 According to the documentation of ThreadPoolExecutor If max_workers is None or not given, it will default to the number of processors on the machine. If I don't set it a value like this ThreadPoolExecutor(max_workers=None) is it bad for performance in case that my value is very low (2) ? Will python already allocate all the CPU processes for None value vs allocate only 2 for value with a number? 回答1: To begin with, you seem to be quoting the wrong part of the documentation in your link,

Is there a way to disable CPU cache (L1/L2) on a Linux system?

让人想犯罪 __ 提交于 2019-11-28 08:42:07
问题 I am profiling some code on a Linux system (running on Intel Core i7 4500U) to obtain the time of ONLY the execution costs. The application is the demo mpeg2dec from libmpeg2. I am trying to obtain a probability distribution for the mpeg2 execution times. However we want to see the raw execution cost when cache is switched off. Is there a way I can disable the cpu cache of my system via a Linux command, or via a gcc flag ? or even set the cpu (L1/L2) cache size to 0KB ? or even add some code

Implementation of __builtin_clz

回眸只為那壹抹淺笑 提交于 2019-11-28 08:13:25
What is the implementation of GCC's (4.6+) __builtin_clz ? Does it correspond to some CPU instruction on Intel x86_64 (AVX) ? It should translate to a Bit Scan Reverse instruction and a subtract. The BSR gives the index of the leading 1, and then you can subtract that from the word size to get the number of leading zeros. Edit: if your CPU supports LZCNT (Leading Zero Count), then that will probably do the trick too, but not all x86-64 chips have that instruction. Yes, and no. CLZ (count leading zero) and BSR (bit-scan reverse) are related but different. CLZ equals (type bit width less one) -