memory-bandwidth

How much does parallelization help the performance if the program is memory-bound?

偶尔善良 提交于 2020-01-05 14:09:32
问题 I parallelized a Java program. On a Mac with 4 cores, below is the time for different number of threads. threads # 1 2 4 8 16 time 2597192200 1915988600 2086557400 2043377000 1931178200 On a Linux server with two sockets, each with 4 cores, below is the measured time. threads # 1 2 4 8 16 time 4204436859 2760602109 1850708620 2370905549 2422668438 As you seen, the speedup is far away from linear speedup. There is almost no parallelization overhead in this case, like synchronization, or I/O

What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?

青春壹個敷衍的年華 提交于 2019-12-28 03:05:27
问题 This question is specifically aimed at modern x86-64 cache coherent architectures - I appreciate the answer can be different on other CPUs. If I write to memory, the MESI protocol requires that the cache line is first read into cache, then modified in the cache (the value is written to the cache line which is then marked dirty). In older write-though micro-architectures, this would then trigger the cache line being flushed, under write-back the cache line being flushed can be delayed for some

Fastest way to convert bytes to unsigned int

跟風遠走 提交于 2019-12-14 03:50:05
问题 I have an array of bytes ( unsigned char * ) that must be converted to integer. Integers are represented over three bytes. This is what I have done //bytes array is allocated and filled //allocating space for intBuffer (uint32_t) unsigned long i = 0; uint32_t number; for(; i<size_tot; i+=3){ uint32_t number = (bytes[i]<<16) | (bytes[i+1]<<8) | bytes[i+2]; intBuffer[number]++; } This piece of code does its jobs well but it is incredibly slow due to the three accesses in memory (especially for

what does STREAM memory bandwidth benchmark really measure?

喜欢而已 提交于 2019-12-12 12:19:34
问题 I have a few questions on STREAM (http://www.cs.virginia.edu/stream/ref.html#runrules) benchmark. Below is the comment from stream.c. What is the rationale about the requirement that arrays should be 4 times the size of cache? * (a) Each array must be at least 4 times the size of the * available cache memory. I don't worry about the difference * between 10^6 and 2^20, so in practice the minimum array size * is about 3.8 times the cache size. I originally assume STREAM measures the peak memory

SIMD intrinsic and memory bus size - How CPU fetches all 128/256 bits in a single memory read?

核能气质少年 提交于 2019-12-11 17:01:04
问题 Hello Forum – I have a few similar/related questions about SIMD intrinsic for which I searched online including stackoverflow but did not find good answers so requesting your help. Basically I am trying to understand how a 64 bit CPU fetches all 128 bits in a single read and what are the requirements for such an operation. Would CPU fetch all 128 bits from memory in a single memory operation or will it do two 64 bit reads? Do CPU manufactures demand certain size of the memory bus, example,

How to get memory bandwidth from memory clock/memory speed

我是研究僧i 提交于 2019-11-28 21:41:06
FYI, Here are the specs I got from Nvidia http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-680/specifications http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan/specifications Note that the memory speed/memory clock are the same thing on their website and are both measured in Gbps. Thanks! The Titan has a 384bit bus while a GTX 680 only has 256, hence 50% more memory bandwidth (assuming clock and latencies are identical. Edit: I'll try to explain the whole concept a bit more: the following is a simplified model of the factors that determine the performance of RAM (not only

Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?

房东的猫 提交于 2019-11-28 13:47:10
问题 Is there any possible optimization for random access on a very big array (I currently use uint8_t , and I'm asking about what's better) uint8_t MyArray[10000000]; when the value at any position in the array is 0 or 1 for 95% of all cases, 2 in 4% of cases, between 3 and 255 in the other 1% of cases? So, is there anything better than a uint8_t array to use for this? It should be as quick as possible to loop over the whole array in a random order, and this is very heavy on RAM bandwidth, so

How to increase performance of memcpy

北城余情 提交于 2019-11-27 17:03:26
Summary: memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies? Full details: As part of a data capture application (using some specialized hardware), I need to copy about 3 GB/sec from temporary buffers into main memory. To acquire data, I provide the hardware driver with a series of buffers (2MB each). The hardware DMAs data to each buffer, and then notifies my program when each buffer is full. My program empties the buffer (memcpy to another, larger block of RAM), and reposts the processed buffer to the

How to get memory bandwidth from memory clock/memory speed

允我心安 提交于 2019-11-27 14:03:31
问题 FYI, Here are the specs I got from Nvidia http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-680/specifications http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan/specifications Note that the memory speed/memory clock are the same thing on their website and are both measured in Gbps. Thanks! 回答1: The Titan has a 384bit bus while a GTX 680 only has 256, hence 50% more memory bandwidth (assuming clock and latencies are identical. Edit: I'll try to explain the whole concept a