vtune

Pthread Mutex: pthread_mutex_unlock() consumes lots of time

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-03 11:40:55
问题 I wrote a multi-thread program with pthread, using the producer-consumer model. When I use Intel VTune profiler to profile my program, I found the producer and consumer spend lots of time on pthread_mutex_unlock. I don't understand why this happened. I think threads may wait a long time before they can acquire a mutex, but releasing a mutex should be fast, right? The snapshot below is from Intel VTune. It shows the codes where consumer tries to fetch an item from the buffer, and time consumed

Using C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?

余生长醉 提交于 2019-11-28 07:34:02
Continuing on from my first question, I am trying to optimize a memory hotspot found via VTune profiling a 64-bit C program. In particular, I'd like to find the fastest way to test if a 128-byte block of memory contains all zeros. You may assume any desired memory alignment for the memory block; I used 64-byte alignment. I am using a PC with an Intel Ivy Bridge Core i7 3770 processor with 32 GB of memory and the free version of Microsoft vs2010 C compiler. My first attempt was: const char* bytevecM; // 4 GB block of memory, 64-byte aligned size_t* psz; // size_t is 64-bits // ... // "m7 &

How should I interpreter these VTune results?

允我心安 提交于 2019-11-28 02:24:44
I'm trying to parallelyzing this code using OpenMP. OpenCV (built using IPP for best efficiency) is used as external library. I'm having problems unbalanced CPU usage in parallel for s, but it seems that there is no load imbalance. As you will see, this could be because of KMP_BLOCKTIME=0 , but this could be necessary because of external libraries (IPP, TBB, OpenMP, OpenCV). In the rest of the questions you will find more details and data that you can download. These are the Google Drive links to my VTune results: c755823 basic KMP_BLOCKTIME=0 30 runs : basic hotspot with environment variable