smp

OpenMP and NUMA relation?

百般思念 提交于 2019-12-03 05:19:20
问题 I have a dual socket Xeon E5522 2.26GHZ machine (with hyperthreading disabled) running ubuntu server on linux kernel 3.0 supporting NUMA. The architecture layout is 4 physical cores per socket. An OpenMP application runs in this machine and i have the following questions: Does an OpenMP program take advantage (i.e a thread and its private data are kept on a numa node along the execution) automatically when running on a NUMA machine + aware kernel?. If not, what can be done? what about NUMA

How are atomic operations implemented at a hardware level?

≯℡__Kan透↙ 提交于 2019-12-03 01:02:59
问题 I get that at the assembly language level instruction set architectures provide compare and swap and similar operations. However, I don't understand how the chip is able to provide these guarantees. As I imagine it, the execution of the instruction must Fetch a value from memory Compare the value Depending on the comparison, possibly store another value in memory What prevents another core from accessing the memory address after the first has fetched it but before it sets the new value? Does

OpenMP and NUMA relation?

孤街浪徒 提交于 2019-12-02 18:35:37
I have a dual socket Xeon E5522 2.26GHZ machine (with hyperthreading disabled) running ubuntu server on linux kernel 3.0 supporting NUMA. The architecture layout is 4 physical cores per socket. An OpenMP application runs in this machine and i have the following questions: Does an OpenMP program take advantage (i.e a thread and its private data are kept on a numa node along the execution) automatically when running on a NUMA machine + aware kernel?. If not, what can be done? what about NUMA and per thread private C++ STL data structures ? The current OpenMP standard defines a boolean

Concurrent stores seen in a consistent order

半腔热情 提交于 2019-12-02 01:24:10
The Intel Architectures Software Developer's Manual, Aug. 2012, vol. 3A, sect. 8.2.2: Any two stores are seen in a consistent order by processors other than those performing the stores. But can this be so? The reason I ask is this: Consider a dual-core Intel i7 processor with HyperThreading. According to the Manual's vol. 1, Fig. 2-8, the i7's logical processors 0 and 1 share an L1/L2 cache, but its logical processors 2 and 3 share a different L1/L2 cache -- whereas all the logical processors share a single L3 cache. Suppose that logical processors 0 and 2 -- which do not share an L1/L2 cache

What limits scaling in this simple OpenMP program?

佐手、 提交于 2019-11-30 14:42:04
I'm trying to understand limits to parallelization on a 48-core system (4xAMD Opteron 6348, 2.8 Ghz, 12 cores per CPU). I wrote this tiny OpenMP code to test the speedup in what I thought would be the best possible situation (the task is embarrassingly parallel): // Compile with: gcc scaling.c -std=c99 -fopenmp -O3 #include <stdio.h> #include <stdint.h> int main(){ const uint64_t umin=1; const uint64_t umax=10000000000LL; double sum=0.; #pragma omp parallel for reduction(+:sum) for(uint64_t u=umin; u<umax; u++) sum+=1./u/u; printf("%e\n", sum); } I was surprised to find that the scaling is

____cacheline_aligned_in_smp for structure in the Linux kernel

∥☆過路亽.° 提交于 2019-11-30 10:00:06
In the Linux kernel, why do many structures use the ____cacheline_aligned_in_smp macro? Does it help increase performance when accessing the structure? If yes then how? Murthy Munna Each cache line in any cache (dcache or icache) is 64 bytes (in x86) architecture. Cache alignment is required to avoid false sharing of cache lines. If the cache lines are shared between global variables (happens more in kernel) If one of the global variables changed by one of the processor in its cache then it marks that cache line as dirty. In remaining CPU cache line it becomes stale entry, which needs to be

How are percpu pointers implemented in the Linux kernel?

大兔子大兔子 提交于 2019-11-30 04:56:49
On multiprocessor, each core can have its own variables. I thought they are different variables in different addresses, although they are in same process and have the same name. But I am wondering, how does the kernel implement this? Does it dispense a piece of memory to deposit all the percpu pointers, and every time it redirects the pointer to certain address with shift or something? Normal global variables are not per CPU. Automatic variables are on the stack, and different CPUs use different stack, so naturally they get separate variables. I guess you're referring to Linux's per-CPU

What is TLB shootdown?

ⅰ亾dé卋堺 提交于 2019-11-29 20:32:57
What is a TLB shootdown in SMPs? I am unable to find much information regarding this concept. Any good example would be very much appreciated. Carl Norum A quick example: You have some memory shared by all of the processors in your system. One of your processors restricts access to a page of that shared memory. Now, all of the processors have to flush their TLBs, so that the ones that were allowed to access that page can't do so any more. The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown. Gabe A TLB (Translation Lookaside Buffer)

____cacheline_aligned_in_smp for structure in the Linux kernel

▼魔方 西西 提交于 2019-11-29 15:07:26
问题 In the Linux kernel, why do many structures use the ____cacheline_aligned_in_smp macro? Does it help increase performance when accessing the structure? If yes then how? 回答1: Each cache line in any cache (dcache or icache) is 64 bytes (in x86) architecture. Cache alignment is required to avoid false sharing of cache lines. If the cache lines are shared between global variables (happens more in kernel) If one of the global variables changed by one of the processor in its cache then it marks

How to use the APIC to create IPIs to wake the APs for SMP in x86 assembly?

£可爱£侵袭症+ 提交于 2019-11-29 04:28:07
In a post-boot enviroment (no OS), how would one use the BSP (first core/processor) to create IPIs for the APs (all other cores/processors)? Essentially, how does one wake and set the instruction pointer for the other cores when starting from one? WARNING: I've assumed 80x86 here. If it's not 80x86 then I don't know :-) First you need to find out how many other CPUs exist and what their APIC IDs are, and determine the physical address of the local APICs. To do this you parse ACPI tables (see MADT/APIC in the ACPI specification). If you can't find valid ACPI tables (e.g. computer is too old)