intel | 易学教程

How to use Intel C++ Compiler with CUDA nvcc?

阅读更多关于 How to use Intel C++ Compiler with CUDA nvcc?

问题 I'm using NVIDIA CUDA 4.1 on Microsoft Visual studio 2008. I also have Intel Parallel Studio XE 2011 Installed. By default, NVIDIA's C Compiler nvcc.exe uses Microsoft's C Compiler cl.exe to compile its C code. How can I change the settings so that nvcc uses Intel's C Compiler icl.exe . 回答1: Unfortunately you cannot (or at least its HIGHLY unrecommended). The only compiler supported on windows is visual studio. Unless something has changed and they now support intel's compilers i wouldn't

How to turn on C++0x of Intel C++ Compiler 12.1.2

阅读更多关于 How to turn on C++0x of Intel C++ Compiler 12.1.2

问题 I installed the latest version of Intel C++ Compiler v12.1.2 on Arch Linux 3.2.1. When I used icpc to compile my C++ file icpc -O3 -DNDEBUG -std=gnu++0x -o obj/main.o src/main.cpp -c or icpc -O3 -DNDEBUG -std=c++0x -o obj/main.o src/main.cpp -c A warning popped out Warning #2928: the __GXX_EXPERIMENTAL_CXX0X__ macro is disabled when using GNU version 4.6 with the c++0x option My main.cpp contains many C++0x features such as rvalue references, auto, etc. But the Intel compiler did not work in

Core profile vs version string? Only getting GLSL 1.3/OGL 3.0 in mesa 10.0.1

阅读更多关于 Core profile vs version string? Only getting GLSL 1.3/OGL 3.0 in mesa 10.0.1

问题 In theory, mesa 10.0.1 should support OpenGL 3.3 but currently I'm only getting 3.0 support. glxinfo gives some confusing results... [pdel@architect build]$ glxinfo | grep -i opengl OpenGL vendor string: Intel Open Source Technology Center OpenGL renderer string: Mesa DRI Intel(R) Ivybridge Mobile OpenGL core profile version string: 3.3 (Core Profile) Mesa 10.0.1 OpenGL core profile shading language version string: 3.30 OpenGL core profile context flags: (none) OpenGL core profile profile

How to transpose a 16x16 matrix using SIMD instructions?

阅读更多关于 How to transpose a 16x16 matrix using SIMD instructions?

问题 I'm currently writing some code targeting Intel's forthcoming AVX-512 SIMD instructions, which supports 512-bit operations. Now assuming there's a matrix represented by 16 SIMD registers, each holding 16 32-bit integers (corresponds to a row), how can I transpose the matrix with purely SIMD instructions? There're already solutions to transposing 4x4 or 8x8 matrices with SSE and AVX2 respectively. But I couldn't figure out how to extend it to 16x16 with AVX-512. Any ideas? 回答1: For two operand

How to transpose a 16x16 matrix using SIMD instructions?

阅读更多关于 How to transpose a 16x16 matrix using SIMD instructions?

GCC compiles leading zero count poorly unless Haswell specified

阅读更多关于 GCC compiles leading zero count poorly unless Haswell specified

问题 GCC supports the __builtin_clz(int x) builtin, which counts the number of number of leading zeros (consecutive most-significant zeros) in the argument. Among other things 0 , this is great for efficiently implementing the lg(unsigned int x) function, which takes the base-2 logarithm of x , rounding down 1 : /** return the base-2 log of x, where x > 0 */ unsigned lg(unsigned x) { return 31U - (unsigned)__builtin_clz(x); } This works in the straightforward way - in particular consider the case

Do 128bit cross lane operations in AVX512 give better performance?

阅读更多关于 Do 128bit cross lane operations in AVX512 give better performance?

问题 In designing forward looking algorithms for AVX256, AVX512 and one day AVX1024 and considering the potential implementation complexity/cost of fully generic permutes for large SIMD width I wondered if it is better to generally keep to isolated 128bit operations even within AVX512? Especially given that AVX had 128bit units to execute 256bit operations. To that end I wanted to know if there was a performance difference between AVX512 permute type operations across all of the 512bit vector as

Why use _mm_malloc? (as opposed to _aligned_malloc, alligned_alloc, or posix_memalign)

阅读更多关于 Why use _mm_malloc? (as opposed to _aligned_malloc, alligned_alloc, or posix_memalign)

问题 There are a few options for acquiring an aligned block of memory but they're very similar and the issue mostly boils down to what language standard and platforms you're targeting. C11 void * aligned_alloc (size_t alignment, size_t size) POSIX int posix_memalign (void **memptr, size_t alignment, size_t size) Windows void * _aligned_malloc(size_t size, size_t alignment); And of course it's also always an option to align by hand. Intel offers another option. Intel void* _mm_malloc (int size, int

How to control which core a process runs on?

阅读更多关于 How to control which core a process runs on?

问题 I can understand how one can write a program that uses multiple processes or threads: fork() a new process and use IPC, or create multiple threads and use those sorts of communication mechanisms. I also understand context switching. That is, with only once CPU, the operating system schedules time for each process (and there are tons of scheduling algorithms out there) and thereby we achieve running multiple processes simultaneously. And now that we have multi-core processors (or multi

How to calculate time for an asm delay loop on x86 linux?

阅读更多关于 How to calculate time for an asm delay loop on x86 linux?

问题 I was going through this link delay in assembly to add delay in assembly. I want to perform some experiment by adding different delay value. The useful code to generate delay ; start delay mov bp, 43690 mov si, 43690 delay2: dec bp nop jnz delay2 dec si cmp si,0 jnz delay2 ; end delay What I understood from the code, the delay is proportion to the time it spends to execute nop instructions (43690x43690 ). So in different system and different version of OS, delay will be different. Am I right?