vtune

Why is 'add' taking so long in my application?

南笙酒味 提交于 2021-01-28 11:13:20
问题 I'm profiling an application using Intel VTune, and there is one particular hotspot where I'm copying a __m128i member variable in the copy constructor of a C++ class. VTune gives this breakdown: Instruction CPU Time: Total CPU Time: Self Block 1: vmovdqa64x (%rax), %xmm0 4.1% 0.760s add $0x10, %rax 46.6% 8.594s Block 2: vmovapsx %xmm0, -10x(%rdx) 6.5% 1.204s (If it matters, compiler is gcc 7.4.0) I admit I'm an assembly noob, but it's very surprising that one particular add instruction is

开源分布式存储Curve ChunkServer CPU优化实践

醉酒当歌 提交于 2020-12-15 09:49:27
Curve ChunkServer的CPU瓶颈问题 Curve是网易数帆开源的新一代分布式存储系统,具有高性能、高可用、高可靠的特点,可作为多种存储场景的底层存储,包括块存储、对象存储、云原生数据库、EC等。 对于分布式块存储系统来说,IOPS是最重要的一个性能指标。从Curve目前的性能测试情况看,读IOPS瓶颈在Client端——对于6个存储节点的集群,单个Client节点读IOPS接近30万,两个Client节点读IOPS接近60万。而Curve的写IOPS还有一定提升空间——对于6个存储节点的集群,IOPS只能达到26万~28万,而ChunkServer节点CPU使用率接近100%,而底层SSD的使用率则不到90%。因此,随机写IOPS场景是Curve的一个优化重点。 在测试环境A 中部署Curve(具体配置见附录1),在Client节点创建10个卷,进行4KB随机写测试。结果显示,写IOPS约为13.5万,而此时ChunkServer节点的CPU使用率接近100%,而所有SSD的使用率平均不到85%。 这表明,ChunkServer端CPU成为性能瓶颈。考虑到目前测试环境的SSD配置较低,若使用高性能NVME SSD,其IOPS可能比现有SSD高一个数量级,届时CPU性能瓶颈将更为严重。因此,优化CPU性能,释放SSD的I/O能力,是Curve性能优化的一个重要方向

How to profile the number of additions, mutltiplications etc. with vtune

懵懂的女人 提交于 2020-01-06 04:26:12
问题 I am able to profile my C++ library's instruction counts with Vtune using the 'INST_RETIRED.ANY' event. What analysis types or events can be used profile in terms of number of integer/floating point additions, multiplications, divisions etc? 回答1: (tl:dr): I don't think you can do everything you want with perf counters. See the end of this answer for a possible way using binary instrumentation Also note that imul is not an expensive operation, and FP mul is barely more expensive than add. e.g.

Effects of Loop unrolling on memory bound data

白昼怎懂夜的黑 提交于 2020-01-03 02:24:10
问题 I have been working with a piece of code which is intensively memory bound. I am trying to optimize it within a single core by manually implementing cache blocking, sw prefetching, loop unrolling etc. Even though cache blocking gives significant improvement in performance. However when i introduce loop unrolling I get tremendous performance degradation. I am compiling with Intel icc with compiler flags -O2 and -ipo in all my test cases. My code is similar to this (3D 25-point stencil): void

Hotspot in a for loop

扶醉桌前 提交于 2019-12-21 04:56:09
问题 I am trying to optimize this code. static lvh_distance levenshtein_distance( const std::string & s1, const std::string & s2 ) { const size_t len1 = s1.size(), len2 = s2.size(); std::vector<unsigned int> col( len2+1 ), prevCol( len2+1 ); const size_t prevColSize = prevCol.size(); for( unsigned int i = 0; i < prevColSize; i++ ) prevCol[i] = i; for( unsigned int i = 0, j; i < len1; ++i ) { col[0] = i+1; const char s1i = s1[i]; for( j = 0; j < len2; ++j ) { const auto minPrev = 1 + std::min( col

How should I interpreter these VTune results?

孤人 提交于 2019-12-17 17:24:24
问题 I'm trying to parallelyzing this code using OpenMP. OpenCV (built using IPP for best efficiency) is used as external library. I'm having problems unbalanced CPU usage in parallel for s, but it seems that there is no load imbalance. As you will see, this could be because of KMP_BLOCKTIME=0 , but this could be necessary because of external libraries (IPP, TBB, OpenMP, OpenCV). In the rest of the questions you will find more details and data that you can download. These are the Google Drive

Cannot locate debugging symbols and a lot of idle CPU usage

蹲街弑〆低调 提交于 2019-12-13 07:04:33
问题 I'm new to VTune Amplifier and I'm trying to profile OpenCV with a very basic application. Following this guide on recommended compiler options, I compiled OpenCV via CMake with CMAKE_BUILD_TYPE=RelWithDebInfo and -DWITH_OPENMP=ON so both -O2 and -g options are included and OpenMP enabled. My testing OpenCV application is compiled with g++ -I/home/luca/Dropbox/SURFSPM/opencvInstall/include -O3 -g -Wall -c -fmessage-length=0 -MMD -MP -MF"main.d" -MT"main.o" -o "main.o" "../main.cpp" via

Understanding VTune report

我与影子孤独终老i 提交于 2019-12-11 01:47:10
问题 this is a followup to an existing thread (http://stackoverflow.com/questions/12724887/caching-in-a-high-performance-financial-application) - I found that it's not the cache that hinders my application. To cut the long story short, I have an application which spends 70 percent of the runtime in one function (15 seconds out of 22). Hence, I would like to cut the runtime of this function as much as possible as the envisaged use of the function is for MUCH larger data (i.e. 22 seconds is not the

Optimzing SSE-code

房东的猫 提交于 2019-12-10 13:19:19
问题 I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough. Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation. The inner

Why does g++ (4.6 and 4.7) promote the result of this division to a double? Can I stop it?

让人想犯罪 __ 提交于 2019-12-10 05:52:46
问题 I was writing some templated code to benchmark a numeric algorithm using both floats and doubles, in order to compare against a GPU implementation. I discovered that my floating point code was slower and after investigating using Vtune Amplifier from Intel I discovered that g++ was generating extra x86 instructions (cvtps2pd/cvtpd2ps and unpcklps/unpcklpd) to convert some intermediate results from float to double and then back again. The performance degradation is almost 10% for this