multicore

How to make numba @jit use all cpu cores (parallelize numba @jit)

萝らか妹 提交于 2019-11-30 06:52:32
I am using numbas @jit decorator for adding two numpy arrays in python. The performance is so high if I use @jit compared with python . However it is not utilizing all CPU cores even if I pass in @numba.jit(nopython = True, parallel = True, nogil = True) . Is there any way to to make use of all CPU cores with numba @jit . Here is my code: import time import numpy as np import numba SIZE = 2147483648 * 6 a = np.full(SIZE, 1, dtype = np.int32) b = np.full(SIZE, 1, dtype = np.int32) c = np.ndarray(SIZE, dtype = np.int32) @numba.jit(nopython = True, parallel = True, nogil = True) def add(a, b, c):

Is it possible to set affinity with sched_setaffinity in Android?

大城市里の小女人 提交于 2019-11-30 06:50:25
Is it possible to set CPU affinity in native C code compiled with the Android NDK? Since the system is using a Linux kernel, it should be possible to use the sched_setaffinity/sched_getaffinity functions, but when I compile with the NDK, I get errors complaining that the cpu_set_t type is unknown (which is used as an argument to the functions). Is there any other way to accomplish this? When I compile with CodeSourcerys ARM compiler (arm-none-linux-gnueabi-gcc) this does not seem to be a problem, so the system obviously supports the required syscalls. The following code works well with NDK r5

What is a Warm-Up Cache?

别说谁变了你拦得住时间么 提交于 2019-11-30 06:41:26
I am working with some multicore simulators such as GEMS or M5. In all of them there is an option to "Warm up the cache". What does that term mean? The warm up is just the period of loading a set of data so that the cache gets populated with valid data. If you're doing performance testing against a system that usually has a high frequency of cache hits, without the warm up you'll get false numbers because what would normally be a cache hit in your usage scenario is not and will drag your numbers down. 来源: https://stackoverflow.com/questions/434259/what-is-a-warm-up-cache

Do multi-core CPUs share the MMU and page tables?

偶尔善良 提交于 2019-11-30 06:30:05
问题 On a single core computer, one thread is executing at a time. On each context switch the scheduler checks if the new thread to schedule is in the same process than the previous one. If so, nothing needs to be done regarding the MMU (pages table). In the other case, the pages table needs to be updated with the new process pages table. I am wondering how things happen on a multi-core computer. I guess there is a dedicated MMU on each core, and if two threads of the same process are running

Does ProfileOptimization actually work?

断了今生、忘了曾经 提交于 2019-11-30 06:26:08
One of the new performance enhanchements for .NET 4.5 is the introduction of the 'MultiCode JIT'. See here for more details. I have tried this, but it seems to have no effect on my application. The reason why I am interested is that my app (IronScheme) takes a good long time to startup if not NGEN'd, which implies a fair amount of JIT'ng is involved at startup. (1.4 sec vs 0.1 sec when NGEN'd). I have followed the instructions on how to enable this, and I can see a 'small' (4-12KB) is created. But on subsequent startup, it seems to have absolutely no effect on improving the startup time. It is

How do memory fences work?

泄露秘密 提交于 2019-11-30 06:01:07
I need to understand memory fences in multicore machines. Say I have this code Core 1 mov [_x], 1; mov r1, [_y] Core 2 mov [_y], 1; mov r2, [_x] Now the unexpected results without memory fences would be that both r1 and r2 can be 0 after execution. In my opinion, to counter that problem, we should put memory fence in both codes, as putting it to only one would still not solve the problem. Something like as follows... Core 1 mov [_x], 1; memory_fence; mov r1, [_y] Core 2 mov [_y], 1; memory_fence; mov r2, [_x] Is my understanding correct or am I still missing something? Assume the architecture

Scalable memory allocator experiences

不想你离开。 提交于 2019-11-30 05:30:24
I am currently evaluating a few of scalable memory allocators, namely nedmalloc and ptmalloc (both built on top of dlmalloc), as a replacement for default malloc / new because of significant contention seen in multithreaded environment. Their published performance seems to be good, however I would like to check what are experiences of other people who have really used them. Were your performance goals satisfied? Did you experience any unexpected or hard to solve issues (like heap corruption)? If you have tried both ptmaalloc and nedmalloc, which of the two would you recommend? Why (ease of use

Using Hardware Performance Counters in Linux

倖福魔咒の 提交于 2019-11-30 03:34:54
I want to use the Hardware Performance Counters that come with the Intel and AMD x86_64 multicore processors to calculate the number of retired stores by a program. I want each thread to calculate its retired stores separately. Can it be done? And if so, how in C/C++? You can use Perfctr or PAPI if you want to count hardware events on some part of the program internally (without starting any 3rd party tool). Perfctr quickstart: http://www.ale.csce.kyushu-u.ac.jp/~satoshi/how_to_use_perfctr.htm PAPI homepage: http://icl.cs.utk.edu/papi/ PerfSuite good doc: http://perfsuite.ncsa.illinois.edu

Memory barriers force cache coherency?

雨燕双飞 提交于 2019-11-30 03:26:11
问题 I was reading this question about using a bool for thread control and got intrigued by this answer by @eran: Using volatile is enough only on single cores, where all threads use the same cache. On multi-cores, if stop() is called on one core and run() is executing on another, it might take some time for the CPU caches to synchronize, which means two cores might see two different views of isRunning_. If you use synchronization mechanisms, they will ensure all caches get the same values, in the

High-level Compare And Swap (CAS) functions?

巧了我就是萌 提交于 2019-11-30 02:06:37
I'd like to document what high-level (i.e. C++ not inline assembler ) functions or macros are available for Compare And Swap (CAS) atomic primitives... E.g., WIN32 on x86 has a family of functions _InterlockedCompareExchange in the <_intrin.h> header. I'll let others list the various platform-specific APIs, but for future reference in C++09 you'll get the atomic_compare_exchange() operation in the new "Atomic operations library". glib, a common system library on Linux and Unix systems (but also supported on Windows and Mac OS X), defines several atomic operations , including g_atomic_int