prefetch

When should we use prefetch?

拈花ヽ惹草 提交于 2019-11-27 14:30:56
Some CPU and compilers supply prefetch instructions. Eg: __builtin_prefetch in GCC Document . Although there is a comment in GCC's document, but it's too short to me. I want to know, in prantice, when should we use prefetch? Are there some examples? Thx! This question isn't really about compilers as they're just providing some hook to insert prefetch instructions into your assembly code / binary. Different compilers may provide different intrinsic formats but you can just ignore all these and (carefully) add it directly in assembly code. Now the real question seems to be "when are prefetches

Do current x86 architectures support non-temporal loads (from “normal” memory)?

大兔子大兔子 提交于 2019-11-27 11:01:55
I am aware of multiple questions on this topic, however, I haven't seen any clear answers nor any benchmark measurements. I thus created a simple program that works with two arrays of integers. The first array a is very large (64 MB) and the second array b is small to fit into L1 cache. The program iterates over a and adds its elements to corresponding elements of b in a modular sense (when the end of b is reached, the program starts from its beginning again). The measured numbers of L1 cache misses for different sizes of b is as follows: The measurements were made on a Xeon E5 2680v3 Haswell

How to prefetch data using a custom python function in tensorflow

廉价感情. 提交于 2019-11-27 06:04:05
I am trying to prefetch training data to hide I/O latency. I would like to write custom Python code that loads data from disk and preprocesses the data (e.g. by adding a context window). In other words, one thread does data preprocessing and the other does training. Is this possible in TensorFlow? Update: I have a working example based on @mrry's example. import numpy as np import tensorflow as tf import threading BATCH_SIZE = 5 TRAINING_ITERS = 4100 feature_input = tf.placeholder(tf.float32, shape=[128]) label_input = tf.placeholder(tf.float32, shape=[128]) q = tf.FIFOQueue(200, [tf.float32,

why does GCC __builtin_prefetch not improve performance?

陌路散爱 提交于 2019-11-27 02:24:14
I'm writing a program to analyze a graph of social network. It means the program needs a lot of random memory accesses. It seems to me prefetch should help. Here is a small piece of the code of reading values from neighbors of a vertex. for (size_t i = 0; i < v.get_num_edges(); i++) { unsigned int id = v.neighbors[i]; res += neigh_vals[id]; } I transform the code above to the one as below and prefetch the values of the neighbors of a vertex. int *neigh_vals = new int[num_vertices]; for (size_t i = 0; i < v.get_num_edges(); i += 128) { size_t this_end = std::min(v.get_num_edges(), i + 128); for

Why does django's prefetch_related() only work with all() and not filter()?

瘦欲@ 提交于 2019-11-26 23:49:40
suppose I have this model: class PhotoAlbum(models.Model): title = models.CharField(max_length=128) author = models.CharField(max_length=128) class Photo(models.Model): album = models.ForeignKey('PhotoAlbum') format = models.IntegerField() Now if I want to look at a subset of photos in a subset of albums efficiently. I do it something like this: someAlbums = PhotoAlbum.objects.filter(author="Davey Jones").prefetch_related("photo_set") for a in someAlbums: somePhotos = a.photo_set.all() This does only two queries, which is what I expect (one to get the albums, and then one like `SELECT * IN

Prefetching Examples?

浪子不回头ぞ 提交于 2019-11-26 19:33:38
Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I'd like the example to meet the following criteria: It is a simple, small, self-contained example. Removing the __builtin_prefetch instruction results in performance degradation. Replacing the __builtin_prefetch instruction with the corresponding memory access results in performance degradation. That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn

When should we use prefetch?

不想你离开。 提交于 2019-11-26 16:46:54
问题 Some CPU and compilers supply prefetch instructions. Eg: __builtin_prefetch in GCC Document. Although there is a comment in GCC's document, but it's too short to me. I want to know, in prantice, when should we use prefetch? Are there some examples? Thx! 回答1: This question isn't really about compilers as they're just providing some hook to insert prefetch instructions into your assembly code / binary. Different compilers may provide different intrinsic formats but you can just ignore all these

Non-temporal loads and the hardware prefetcher, do they work together?

你离开我真会死。 提交于 2019-11-26 16:41:22
When executing a series of _mm_stream_load_si128() calls ( MOVNTDQA ) from consecutive memory locations, will the hardware pre-fetcher still kick-in, or should I use explicit software prefetching (with NTA hint) in order to obtain the benefits of prefetching while still avoiding cache pollution? The reason I ask this is because their objectives seem contradictory to me. A streaming load will fetch data bypassing the cache, while the pre-fetcher attempts to proactively fetch data into the cache. When sequentially iterating a large data structure (processed data won't be retouched in a long

Do current x86 architectures support non-temporal loads (from “normal” memory)?

孤人 提交于 2019-11-26 15:24:04
问题 I am aware of multiple questions on this topic, however, I haven't seen any clear answers nor any benchmark measurements. I thus created a simple program that works with two arrays of integers. The first array a is very large (64 MB) and the second array b is small to fit into L1 cache. The program iterates over a and adds its elements to corresponding elements of b in a modular sense (when the end of b is reached, the program starts from its beginning again). The measured numbers of L1 cache

How to prefetch data using a custom python function in tensorflow

戏子无情 提交于 2019-11-26 12:51:33
问题 I am trying to prefetch training data to hide I/O latency. I would like to write custom Python code that loads data from disk and preprocesses the data (e.g. by adding a context window). In other words, one thread does data preprocessing and the other does training. Is this possible in TensorFlow? Update: I have a working example based on @mrry\'s example. import numpy as np import tensorflow as tf import threading BATCH_SIZE = 5 TRAINING_ITERS = 4100 feature_input = tf.placeholder(tf.float32