multiplication | 易学教程

Perform integer division using multiplication [duplicate]

阅读更多关于 Perform integer division using multiplication [duplicate]

This question already has an answer here: Why does GCC use multiplication by a strange number in implementing integer division? 4 answers Looking at x86 assembly produced by a compiler, I noticed that (unsigned) integer divisions are sometimes implemented as integer multiplications. These optimizations seem to follow the form value / n => (value * ((0xFFFFFFFF / n) + 1)) / 0x100000000 For example, performing a division by 9: 12345678 / 9 = (12345678 * 0x1C71C71D) / 0x100000000 A division by 3 would use multiplication with 0x55555555 + 1 , and so on. Exploiting the fact that the mul instruction

SSE multiplication of 4 32-bit integers

阅读更多关于 SSE multiplication of 4 32-bit integers

How to multiply four 32-bit integers by another 4 integers? I didn't find any instruction which can do it. If you need signed 32x32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want: static inline __m128i muly(const __m128i &a, const __m128i &b) { __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/ __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */ return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE (0,0,2,0))); /* shuffle results to [63..0]

Fast multiplication/division by 2 for floats and doubles (C/C++)

阅读更多关于 Fast multiplication/division by 2 for floats and doubles (C/C++)

In the software I'm writing, I'm doing millions of multiplication or division by 2 (or powers of 2) of my values. I would really like these values to be int so that I could access the bitshift operators int a = 1; int b = a<<24 However, I cannot, and I have to stick with doubles. My question is : as there is a standard representation of doubles (sign, exponent, mantissa), is there a way to play with the exponent to get fast multiplications/divisions by a power of 2 ? I can even assume that the number of bits is going to be fixed (the software will work on machines that will always have 64 bits

Python and Powers Math

阅读更多关于 Python and Powers Math

I've been learning Python but I'm a little confused. Online instructors tell me to use the operator ** as opposed to ^ when I'm trying to raise to a certain number. Example: print 8^3 Gives an output of 11. But what I'm look for (I'm told) is more akin to: print 8**3 which gives the correct answer of 512. But why? Can someone explain this to me? Why is it that 8^3 does not equal 512 as it is the correct answer? In what instance would 11 (the result of 8^3)? I did try to search SO but I'm only seeing information concerning getting a modulus when dividing. Operator ^ is a bitwise operator ,

How to multiply a register by 37 using only 2 consecutive leal instructions in x86?

阅读更多关于 How to multiply a register by 37 using only 2 consecutive leal instructions in x86?

Say %edi contains x and I want to end up with 37*x using only 2 consecutive leal instructions, how would I go about this? For example to get 45x you would do leal (%edi, %edi, 8), %edi leal (%edi, %edi, 4), %eax (to be returned) I cannot for the life of me figure out what numbers to put in place of the 8 and 4 so that the result (%eax) will be 37x At -O3 , gcc will emit (Godbolt compiler explorer) : int mul37(int a) { return a*37; } leal (%rdi,%rdi,8), %eax # eax = a * 9 leal (%rdi,%rax,4), %eax # eax = a + 4*(a*9) ret That's using 37 = 9*4 + 1 , not destroying the original a value with the

Efficient outer product in python

阅读更多关于 Efficient outer product in python

Outer product in python seems quite slow when we have to deal with vectors of dimension of order 10k. Could someone please give me some idea how could I speed up this operation in python? Code is as follows: In [8]: a.shape Out[8]: (128,) In [9]: b.shape Out[9]: (32000,) In [10]: %timeit np.outer(b,a) 100 loops, best of 3: 15.4 ms per loop Since I have to do this operation several times, my code is getting slower. It doesn't really get any faster than that, these are your options: numpy.outer >>> %timeit np.outer(a,b) 100 loops, best of 3: 9.79 ms per loop numpy.einsum >>> %timeit np.einsum('i

Fastest way to multiply an array of int64_t?

阅读更多关于 Fastest way to multiply an array of int64_t?

I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this? Note: I don't want to save the high-half result of each multiplication. void multiply_vex(long *Gi_vec, long q, long *Gj_vec){ int i; __m256i data_j, data_i; __uint64_t *ptr_J = (__uint64_t*)&data_j; __uint64_t *ptr_I = (__uint64_t*)&data_i; for (i=0; i<BASE_VEX_STOP; i+=4) { data_i = _mm256_load_si256((__m256i*)&Gi_vec[i]); data_j = _mm256_load_si256((__m256i*)&Gj_vec[i]); ptr_I[0] -=

Matlab - Multiplying a matrix with every matrix of a 3d matrix

阅读更多关于 Matlab - Multiplying a matrix with every matrix of a 3d matrix

问题 I have two matlab questions that seem closely related. I want to find the most efficient way (no loop?) to multiply a (A x A) matrix with every single matrix of a 3d matrix (A x A x N). Also, I would like to take the trace of each of those products. http://en.wikipedia.org/wiki/Matrix_multiplication#Frobenius_product This is the inner frobenius product. On the crappy code I have below I'm using its secondary definition which is more efficient. I want to multiply each element of a vector (N x

Why is division more expensive than multiplication?

阅读更多关于 Why is division more expensive than multiplication?

I am not really trying to optimize anything, but I remember hearing this from programmers all the time, that I took it as a truth. After all they are supposed to know this stuff. But I wonder why is division actually slower than multiplication? Isn't division just a glorified subtraction, and multiplication is a glorified addition? So mathematically I don't see why going one way or the other has computationally very different costs. Can anyone please clarify the reason/cause of this so I know, instead of what I heard from other programmer's that I asked before which is: "because". CPU's ALU

Bitwise Multiply and Add in Java

阅读更多关于 Bitwise Multiply and Add in Java

I have the methods that do both the multiplication and addition, but I'm just not able to get my head around them. Both of them are from external websites and not my own: public static void bitwiseMultiply(int n1, int n2) { int a = n1, b = n2, result=0; while (b != 0) // Iterate the loop till b==0 { if ((b & 01) != 0) // Logical ANDing of the value of b with 01 { result = result + a; // Update the result with the new value of a. } a <<= 1; // Left shifting the value contained in 'a' by 1. b >>= 1; // Right shifting the value contained in 'b' by 1. } System.out.println(result); } public static