performance

Why is my for loop code slower than an iterator?

我只是一个虾纸丫 提交于 2021-02-08 20:01:27
问题 I am trying to solve the leetcode problem distribute-candies. It is easy, just find out the minimum between the candies' kinds and candies half number. Here's my solution (cost 48ms): use std::collections::HashSet; pub fn distribute_candies(candies: Vec<i32>) -> i32 { let sister_candies = (candies.len() / 2) as i32; let mut kind = 0; let mut candies_kinds = HashSet::new(); for candy in candies.into_iter() { if candies_kinds.insert(candy) { kind += 1; if kind > sister_candies { return sister

Why is vectorized numpy code slower than for loops?

让人想犯罪 __ 提交于 2021-02-08 19:54:46
问题 I have two numpy arrays, X and Y , with shapes (n,d) and (m,d) , respectively. Assume that we want to compute the Euclidean distances between each row of X and each row of Y and store the result in array Z with shape (n,m) . I have two implementations for this. The first implementation uses two for loops as follows: for i in range(n): for j in range(m): Z[i,j] = np.sqrt(np.sum(np.square(X[i] - Y[j]))) The second implementation uses only one loop by vectorization: for i in range(n): Z[i] = np

Why is vectorized numpy code slower than for loops?

元气小坏坏 提交于 2021-02-08 19:54:07
问题 I have two numpy arrays, X and Y , with shapes (n,d) and (m,d) , respectively. Assume that we want to compute the Euclidean distances between each row of X and each row of Y and store the result in array Z with shape (n,m) . I have two implementations for this. The first implementation uses two for loops as follows: for i in range(n): for j in range(m): Z[i,j] = np.sqrt(np.sum(np.square(X[i] - Y[j]))) The second implementation uses only one loop by vectorization: for i in range(n): Z[i] = np

Performance optimisations of x86-64 assembly - Alignment and branch prediction

故事扮演 提交于 2021-02-08 19:50:37
问题 I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen() , memset() , etc, using x86-64 assembly with SSE-2 instructions. So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more. For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely

Is there a SQL server performance counter for average execution time?

我们两清 提交于 2021-02-08 15:46:52
问题 I want to tune a production SQL server. After making adjustments (such as changing the degree of parallelism) I want to know if it helped or hurt query execution times. This seems like an obvious performance counter, but for the last half hour I've been searching Google and the counter list in perfmon, and I have not been able to find a performance counter for SQL server to give me the average execution time for all queries hitting a server. The SQL Server equivalent of the ASP.NET Request

Are memory orderings: consume, acq_rel and seq_cst ever needed on Intel x86?

北战南征 提交于 2021-02-08 14:36:43
问题 C++11 specifies six memory orderings: typedef enum memory_order { memory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_seq_cst } memory_order; https://en.cppreference.com/w/cpp/atomic/memory_order where the default is seq_cst. Performance gains can be found by relaxing the memory ordering of operations. However, this depends on what protections the architecture provides. For example, Intel x86 is a strong memory model and

overloaded array subscript [] operator slow

寵の児 提交于 2021-02-08 14:02:23
问题 I have written my own Array class in c++ and overloaded the array subscript [] operator, code: inline dtype &operator[](const size_t i) { return _data[i]; } inline dtype operator[](const size_t i) const { return _data[i];} where _data is a pointer to the memory block containing the array. Profiling shows that this overloaded operator alone is taking about 10% of the overall computation time (on a long monte carlo simulation, and I am compiling using g++ with maximum optimization). This seems

Dictionary with tuple key slower than nested dictionary. Why?

我只是一个虾纸丫 提交于 2021-02-08 13:45:25
问题 I've tested the speed of retrieving, updating and removing values in a dictionary using a (int, int, string) tuple as key versus the same thing with a nested Dictionary: Dictionary>>. My tests show the tuple dictionary to be a lot slower (58% for retrieving, 69% for updating and 200% for removing). I did not expect that. The nested dictionary needs to do more lookups, so why is the tuple dictionary that much slower? My test code: public static object TupleDic_RemoveValue(object[] param) { var

C++ signed and unsigned int vs long long speed

隐身守侯 提交于 2021-02-08 13:42:29
问题 Today, I noticed that the speed of several simple bitwise and arithmetic operations differs significantly between int , unsigned , long long and unsigned long long on my 64-bit pc. In particular, the following loop is about twice as fast for unsigned as for long long , which I didn't expect. int k = 15; int N = 30; int mask = (1 << k) - 1; while (!(mask & 1 << N)) { int lo = mask & ~(mask - 1); int lz = (mask + lo) & ~mask; mask |= lz; mask &= ~(lz - 1); mask |= (lz / lo / 2) - 1; } (full

C++ signed and unsigned int vs long long speed

北战南征 提交于 2021-02-08 13:42:06
问题 Today, I noticed that the speed of several simple bitwise and arithmetic operations differs significantly between int , unsigned , long long and unsigned long long on my 64-bit pc. In particular, the following loop is about twice as fast for unsigned as for long long , which I didn't expect. int k = 15; int N = 30; int mask = (1 << k) - 1; while (!(mask & 1 << N)) { int lo = mask & ~(mask - 1); int lz = (mask + lo) & ~mask; mask |= lz; mask &= ~(lz - 1); mask |= (lz / lo / 2) - 1; } (full