vtune | 易学教程

Why is 'add' taking so long in my application?

阅读更多关于 Why is 'add' taking so long in my application?

问题 I'm profiling an application using Intel VTune, and there is one particular hotspot where I'm copying a __m128i member variable in the copy constructor of a C++ class. VTune gives this breakdown: Instruction CPU Time: Total CPU Time: Self Block 1: vmovdqa64x (%rax), %xmm0 4.1% 0.760s add $0x10, %rax 46.6% 8.594s Block 2: vmovapsx %xmm0, -10x(%rdx) 6.5% 1.204s (If it matters, compiler is gcc 7.4.0) I admit I'm an assembly noob, but it's very surprising that one particular add instruction is

开源分布式存储Curve ChunkServer CPU优化实践

阅读更多关于开源分布式存储Curve ChunkServer CPU优化实践

Curve ChunkServer的CPU瓶颈问题 Curve是网易数帆开源的新一代分布式存储系统，具有高性能、高可用、高可靠的特点，可作为多种存储场景的底层存储，包括块存储、对象存储、云原生数据库、EC等。对于分布式块存储系统来说，IOPS是最重要的一个性能指标。从Curve目前的性能测试情况看，读IOPS瓶颈在Client端——对于6个存储节点的集群，单个Client节点读IOPS接近30万，两个Client节点读IOPS接近60万。而Curve的写IOPS还有一定提升空间——对于6个存储节点的集群，IOPS只能达到26万~28万，而ChunkServer节点CPU使用率接近100%，而底层SSD的使用率则不到90%。因此，随机写IOPS场景是Curve的一个优化重点。在测试环境A 中部署Curve（具体配置见附录1），在Client节点创建10个卷，进行4KB随机写测试。结果显示，写IOPS约为13.5万，而此时ChunkServer节点的CPU使用率接近100%，而所有SSD的使用率平均不到85%。这表明，ChunkServer端CPU成为性能瓶颈。考虑到目前测试环境的SSD配置较低，若使用高性能NVME SSD，其IOPS可能比现有SSD高一个数量级，届时CPU性能瓶颈将更为严重。因此，优化CPU性能，释放SSD的I/O能力，是Curve性能优化的一个重要方向

How to profile the number of additions, mutltiplications etc. with vtune

阅读更多关于 How to profile the number of additions, mutltiplications etc. with vtune

问题 I am able to profile my C++ library's instruction counts with Vtune using the 'INST_RETIRED.ANY' event. What analysis types or events can be used profile in terms of number of integer/floating point additions, multiplications, divisions etc? 回答1: (tl:dr): I don't think you can do everything you want with perf counters. See the end of this answer for a possible way using binary instrumentation Also note that imul is not an expensive operation, and FP mul is barely more expensive than add. e.g.

Effects of Loop unrolling on memory bound data

阅读更多关于 Effects of Loop unrolling on memory bound data

问题 I have been working with a piece of code which is intensively memory bound. I am trying to optimize it within a single core by manually implementing cache blocking, sw prefetching, loop unrolling etc. Even though cache blocking gives significant improvement in performance. However when i introduce loop unrolling I get tremendous performance degradation. I am compiling with Intel icc with compiler flags -O2 and -ipo in all my test cases. My code is similar to this (3D 25-point stencil): void

Hotspot in a for loop

阅读更多关于 Hotspot in a for loop

问题 I am trying to optimize this code. static lvh_distance levenshtein_distance( const std::string & s1, const std::string & s2 ) { const size_t len1 = s1.size(), len2 = s2.size(); std::vector<unsigned int> col( len2+1 ), prevCol( len2+1 ); const size_t prevColSize = prevCol.size(); for( unsigned int i = 0; i < prevColSize; i++ ) prevCol[i] = i; for( unsigned int i = 0, j; i < len1; ++i ) { col[0] = i+1; const char s1i = s1[i]; for( j = 0; j < len2; ++j ) { const auto minPrev = 1 + std::min( col

How should I interpreter these VTune results?

阅读更多关于 How should I interpreter these VTune results?

问题 I'm trying to parallelyzing this code using OpenMP. OpenCV (built using IPP for best efficiency) is used as external library. I'm having problems unbalanced CPU usage in parallel for s, but it seems that there is no load imbalance. As you will see, this could be because of KMP_BLOCKTIME=0 , but this could be necessary because of external libraries (IPP, TBB, OpenMP, OpenCV). In the rest of the questions you will find more details and data that you can download. These are the Google Drive

Cannot locate debugging symbols and a lot of idle CPU usage

阅读更多关于 Cannot locate debugging symbols and a lot of idle CPU usage

问题 I'm new to VTune Amplifier and I'm trying to profile OpenCV with a very basic application. Following this guide on recommended compiler options, I compiled OpenCV via CMake with CMAKE_BUILD_TYPE=RelWithDebInfo and -DWITH_OPENMP=ON so both -O2 and -g options are included and OpenMP enabled. My testing OpenCV application is compiled with g++ -I/home/luca/Dropbox/SURFSPM/opencvInstall/include -O3 -g -Wall -c -fmessage-length=0 -MMD -MP -MF"main.d" -MT"main.o" -o "main.o" "../main.cpp" via

Understanding VTune report

阅读更多关于 Understanding VTune report

问题 this is a followup to an existing thread (http://stackoverflow.com/questions/12724887/caching-in-a-high-performance-financial-application) - I found that it's not the cache that hinders my application. To cut the long story short, I have an application which spends 70 percent of the runtime in one function (15 seconds out of 22). Hence, I would like to cut the runtime of this function as much as possible as the envisaged use of the function is for MUCH larger data (i.e. 22 seconds is not the

Optimzing SSE-code

阅读更多关于 Optimzing SSE-code

问题 I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough. Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation. The inner

Why does g++ (4.6 and 4.7) promote the result of this division to a double? Can I stop it?

阅读更多关于 Why does g++ (4.6 and 4.7) promote the result of this division to a double? Can I stop it?

问题 I was writing some templated code to benchmark a numeric algorithm using both floats and doubles, in order to compare against a GPU implementation. I discovered that my floating point code was slower and after investigating using Vtune Amplifier from Intel I discovered that g++ was generating extra x86 instructions (cvtps2pd/cvtpd2ps and unpcklps/unpcklpd) to convert some intermediate results from float to double and then back again. The performance degradation is almost 10% for this