simd | 易学教程

Most recent processor without support of SSSE3 instructions? [closed]

阅读更多关于 Most recent processor without support of SSSE3 instructions? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . Are there any still-relevant CPUs (Intel/AMD/Atom) which don't support SSSE3 instructions? What's the most recent CPU without SSSE3? 回答1: The most recent CPUs without SSSE3 are based on the AMD K10 microarchitecture: AMD Phenom II , the last-generation K10 socketed desktop CPUs before Bulldozer-family. They were

using restrict qualifier with C99 variable length arrays (VLAs)

阅读更多关于 using restrict qualifier with C99 variable length arrays (VLAs)

问题 I am exploring how different implementations of simple loops in C99 auto-vectorize based upon the function signature. Here is my code: /* #define PRAGMA_SIMD _Pragma("simd") */ #define PRAGMA_SIMD #ifdef __INTEL_COMPILER #define ASSUME_ALIGNED(a) __assume_aligned(a,64) #else #define ASSUME_ALIGNED(a) #endif #ifndef ARRAY_RESTRICT #define ARRAY_RESTRICT #endif void foo1(double * restrict a, const double * restrict b, const double * restrict c) { ASSUME_ALIGNED(a); ASSUME_ALIGNED(b); ASSUME

sse/avx equivalent for neon vuzp

阅读更多关于 sse/avx equivalent for neon vuzp

问题 Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_* . For 4 elements in a vector, it does this: inputs: (A0 A1 A2 A3) (B0 B1 B2 B3) unpacklo/hi: (A0 B0 A1 B1) (A2 B2 A3 B3) The equivalent of unpack is vzip in ARM's NEON instruction set. However, the NEON instruction set also provides the operation vuzp which is the inverse of vzip . For 4 elements in a vector, it does this: inputs: (A0 A1 A2 A3

Intel AVX : Why is there no 256-bits version of dot product for double precision floating point variables? [closed]

阅读更多关于 Intel AVX : Why is there no 256-bits version of dot product for double precision floating point variables? [closed]

问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 6 years ago . In another question on SO we tried (and succeeded) to find a way to replace the AVX missing instruction: __m256d _mm256_dp_pd(__m256d

How to increment a vector in AVX/AVX2

阅读更多关于 How to increment a vector in AVX/AVX2

问题 I want to use intrinsics to increment the elements of a SIMD vector. The simplest way seems to be to add 1 to each element, like this: (note: vec_inc has been set to 1 before) vec = _mm256_add_epi16 (vec, vec_inc); but is there any special instruction to increment a vector? Like inc in this page ? Or any other easier way ? 回答1: The INC instruction is not a SIMD level instruction, it operates on integer scalars. As you and Paul already suggested, the simplest way is to add 1 to each vector

For for an SSE vector that has all the same components, generate on the fly or precompute?

阅读更多关于 For for an SSE vector that has all the same components, generate on the fly or precompute?

问题 When I need to do an vector operation that has an operand that is just a float broadcasted to every component, should I precompute the __m256 or __m128 , and load it when I need it, or broadcast the float to the register using _mm_set1_ps every time I need the vector? I have been precomputing the vectors that are very important and highly used and generating on the fly the ones that are less important. But am I really gaining any speed with precomputing? Is it worth the trouble? Is the _mm

does gcc's __builtin_cpu_supports check for OS support?

阅读更多关于 does gcc's __builtin_cpu_supports check for OS support?

问题 GCC compiler provides a set of builtins to test some processor features, like availability of certain instruction sets. But, according to this thread we also may know certain cpu features may be not enabled by OS. So the question is: do __builtin_cpu_supports intrinsics also check if OS has enabled certain processor feature? 回答1: No. I disabled AVX on my Skylake system by adding noxsave to the Linux kernel boot options. When I do cat /proc/cpuinfo AVX (and AVX2) no longer appear and when I

Does browser JavaScript allow for SIMD or Vectorized operations?

阅读更多关于 Does browser JavaScript allow for SIMD or Vectorized operations?

问题 I want to write applications in JavaScript that require a large amount of numerical computation. However, I'm very confused about the state of efficient linear-algebra-like computation in client-side JavaScript. There seems to be many approaches, but no clear indication of their readiness. Most of them seem to have restrictions of the size of vectors and matrices allowed for computation. WebGL Obviously allows for vector and matrix computations on the GPU, but I'm not clear on the limitations

error: inlining failed to call always_inline

阅读更多关于 error: inlining failed to call always_inline

问题 I am trying to implement and code on some files, some of which contain SIMD-calls. I have compiled this code on a server, running basically the same OS as my machine, yet i cant compile it. This is the error: make g++ main.cpp -march=native -o main -fopenmp In file included from /usr/lib/gcc/x86_64-linux-gnu/7/include/immintrin.h:53:0, from tensor.hpp:9, from main.cpp:4: /usr/lib/gcc/x86_64-linux-gnu/7/include/avx512vlintrin.h: In function ‘_ZN6TensorIdE8add_avx2ERKS0_._omp_fn.5’: /usr/lib

Floating-point number vs fixed-point number: speed on Intel I5 CPU

阅读更多关于 Floating-point number vs fixed-point number: speed on Intel I5 CPU

问题 I have a C/C++ program which involves intensive 32-bit floating-point matrix math computations such as addition, subtraction, multiplication, division, etc. Can I speed up my program by converting 32-bit floating-point numbers into 16-bit fixed-point numbers ? How much speed gain can I get ? Currently I'm working on a Intel I5 CPU. I'm using Openblas to perform the matrix calculations. How should I re-implement Openblas functions such as cblas_dgemm to perform fixed-point calculations ? I