benchmarking | 易学教程

Can't reach peak performance

阅读更多关于 Can't reach peak performance

问题 I'm trying to reach peak performance of each SM from the code below. The peak lies somewhere between 25 GFlops(GTX275-GT200 Arch.). This code gives 8 GFlops at the max. __global__ void new_ker(float *x) { int index = threadIdx.x+blockIdx.x*blockDim.x; float a,b; a=0; b=x[index]; //LOOP=10000000 //No. of blocks = 1 //Threads per block = 512 (I'm using GTX 275 - GT200 Arch.) #pragma unroll 2048 for(int i=0;i<LOOP;i++){ a=a*b+b; } x[index] = a; } I don't want to increase ILP in the code. Any

performance for reads of nsdictionary vs nsarray

阅读更多关于 performance for reads of nsdictionary vs nsarray

问题 Continuing off this post: Performance hit incurred using NSMutableDictionary vs. NSMutableArray> I am trying to run a little test to see if the performance gap is that great for read and writes between NSArray & NSDictionary as well as their mutable coutnerparts... However, I am having difficulties finding a "balanced" test... because the dictionary has 2 (or 3 depending on how you see this) objects to loop through to get the value (not the key) seeked, while the array has only one... Any

Optimization barrier for microbenchmarks in MSVC: tell the optimizer you clobber memory?

阅读更多关于 Optimization barrier for microbenchmarks in MSVC: tell the optimizer you clobber memory?

Chandler Carruth introduced two functions in his CppCon2015 talk that can be used to do some fine-grained inhibition of the optimizer. They are useful to write micro-benchmarks that the optimizer won't simply nuke into meaninglessness. void clobber() { asm volatile("" : : : "memory"); } void escape(void* p) { asm volatile("" : : "g"(p) : "memory"); } These use inline assembly statements to change the assumptions of the optimizer. The assembly statement in clobber states that the assembly code in it can read and write anywhere in memory. The actual assembly code is empty, but the optimizer won

data.table time subset vs xts time subset

阅读更多关于 data.table time subset vs xts time subset

问题 Hi I am looking to subset some minutely data by time. I normally use xts doing something like: subset.string <- 'T10:00/T13:00' xts.min.obj[subset.string] to get all the rows which are between 10am and 1pm (inclusive) EACH DAY and have the output as an xts format. But is a bit slow for my purposes...e.g j <- xts(rnorm(10e6),Sys.time()-(10e6:1)) system.time(j['T10:00/T16:00']) user system elapsed 5.704 0.577 17.115 I know that data.table is v fast and at subsetting large datasets so am

Optimization barrier for microbenchmarks in MSVC: tell the optimizer you clobber memory?

阅读更多关于 Optimization barrier for microbenchmarks in MSVC: tell the optimizer you clobber memory?

问题 Chandler Carruth introduced two functions in his CppCon2015 talk that can be used to do some fine-grained inhibition of the optimizer. They are useful to write micro-benchmarks that the optimizer won't simply nuke into meaninglessness. void clobber() { asm volatile("" : : : "memory"); } void escape(void* p) { asm volatile("" : : "g"(p) : "memory"); } These use inline assembly statements to change the assumptions of the optimizer. The assembly statement in clobber states that the assembly code

Vectorized string operations in Numpy: why are they rather slow?

阅读更多关于 Vectorized string operations in Numpy: why are they rather slow?

This is of those "mostly asked out of pure curiosity (in possibly futile hope I will learn something)" questions. I was investigating ways of saving memory on operations on massive numbers of strings, and for some scenarios it seems like string operations in numpy could be useful. However, I got somewhat surprising results: import random import string milstr = [''.join(random.choices(string.ascii_letters, k=10)) for _ in range(1000000)] npmstr = np.array(milstr, dtype=np.dtype(np.unicode_, 1000000)) Memory consumption using memory_profiler : %memit [x.upper() for x in milstr] peak memory: 420

Tools to benchmark web-services

阅读更多关于 Tools to benchmark web-services

What tools are best for measuring web-services performance? It would be nice to get report for total transferred data, total POSTs, requests per second, time per request, transfer rate and response time per request. Not quite for web services, but a very simple command line tool is distributed with Apache to benchmark HTTP performance, it is called ApacheBench and can be found in the bin directory as ab.exe ApacheBench's documentation I have used jmeter in the past. Check it out. There is also http://www.soapui.org/ that our QA dept used. 来源： https://stackoverflow.com/questions/1449169/tools

MySQL Benchmark

阅读更多关于 MySQL Benchmark

问题 I am trying to use MySQL benchmark to test some queries. But, I am running to an error. SELECT benchmark (10000, (select title from user)); and in return I get this error; ERROR 1242 (21000): Subquery returns more than 1 row Does anyone know how to benchmark a query? Thanks 回答1: select title from user This returns multiple rows, which won't work. Refer to this link: http://dev.mysql.com/doc/refman/5.0/en/information-functions.html#function_benchmark The expression you pass must return a

how would you benchmark the performance of a function

阅读更多关于 how would you benchmark the performance of a function

here's perhaps a more advanced question. if you have two functions that return a value int F(int input1, int input2) { int output; //some algorithm that assigns value to output// return output; } int D(int input1, int input2) { int output; //another algorithm that assigns value to output// return output; } With the condition that F(a,b) == D(a,b) (both return the same value for the same inputs). If you'd like to benchmark their performance, how would you do it? More precisely, how would you isolate the time it takes to perform F(a,b) or D(a,b) such that it does not reflect the time it takes

Timing different sections in CUDA kernel

阅读更多关于 Timing different sections in CUDA kernel

问题 I have a CUDA kernel that calls out to a series of device functions. What is the best way to get the execution time for each of the device functions? What is the best way to get the execution time for a section of code in one of the device functions? 回答1: In my own code, I use the clock() function to get precise timings. For convenience, I have the macros enum { tid_this = 0, tid_that, tid_count }; __device__ float cuda_timers[ tid_count ]; #ifdef USETIMERS #define TIMER_TIC clock_t tic; if (