Time-efficient way to replace numpy entries

问题

I have multiple arrays of the following kind:

import numpy as np

orig_arr = np.full(shape=(5,10), fill_value=1) #only an example, actual entries different

Every entry in the array above is a number to a dictionary containing further information, which is stored in an array;

toy_dict = {0:np.arange(13, 23, dtype=float), 1:np.arange(23, 33, dtype=float)}

My task is to replace the entries in the orig_arr with the array stored in the dict (here it is the toy_dict)

My current approach is a naive approach, but I am looking for faster approaches:

goal_arr = np.full(shape=(orig_arr.shape[0], orig_arr.shape[1], 10), fill_value=2, dtype=float)

for row in range(orig_arr.shape[0]):
  for col in range(orig_arr.shape[1]):
    goal_arr[row,col] = toy_dict[0] # actual replacement happens here

As you can see, I am using an intermediate step, creating a goal_arr which has the desired shape.

My question: How can I add the third dimension in a faster way, what parts can I improve? Thanks in advance!

(Further question I have looked in: "Error: setting an array element with a sequence", Numpy append: Automatically cast an array of the wrong dimension, Append 2D array to 3D array, extending third dimension)

Edit: After mathfux' good answer, I tested his proposed code versus my code in terms of speed comparison for larger arrays (more realistic for my use case):

Imports:

import numpy as np
import time

first_dim = 50
second_dim = 20
depth_dim = 300
upper_count = 5000

toy_dict = {k:np.random.random_sample(size = depth_dim) for k in range(upper_count)}

My original version, after parameterization

start = time.time()

orig_arr = np.random.randint(0, upper_count, size=(first_dim, second_dim))
goal_arr = np.empty(shape=(orig_arr.shape[0], orig_arr.shape[1], depth_dim), dtype=float)


for row in range(orig_arr.shape[0]):
  for col in range(orig_arr.shape[1]):
    goal_arr[row,col] = toy_dict[orig_arr[row, col]]

end = time.time()
print(end-start)

Time: 0.008016824722290039

Now mathfux' kindly provided answer:


start = time.time()
orig_arr = np.random.randint(0, upper_count, size=(first_dim,second_dim))
goal_arr = np.empty(shape=(orig_arr.shape[0], orig_arr.shape[1], depth_dim), dtype=float)

a = np.array(list(toy_dict.values())) #do not know if it can be optimized
idx = np.indices(orig_arr.shape)
goal_arr[idx[0], idx[1]] = a[orig_arr[idx[0], idx[1]]]
end = time.time()
print(end-start)

Time: 0.015697956085205078

Interestingly, the advanced index is slower. I think this is due to the dict->list->array conversion which takes time.

Nevertheless, thank you for your answers.

Edit 2:

I ran the code with the list conversion not occurring in the second code block (but before):

Time: 0.002306699752807617

Now this supports my thesis. Since the toy_dict will be created only once, the proposed solution is faster. Thanks.

回答1:

You need to avoid every iterable object that is not numpy array itself as well as Python level iterations. So you might like to store values of dictionary in separate array and then use fancy indexing:

goal_arr = np.empty(shape=(orig_arr.shape[0], orig_arr.shape[1], 10), dtype=float)
a = np.array(list(toy_dict.values())) #do not know if it can be optimized
idx = np.indices(orig_arr.shape)
goal_arr[idx[0], idx[1]] = a[orig_arr[idx[0], idx[1]]]

You can see here that creation of goal_arr is must-do but I've used np.empty instead of np.full since it's more efficient.

Remark: this way works only if list(toy_dict.keys()) is a list of the form [0, 1, 2...]. In other cases you need to think of how to apply a map toy_dict.keys() -> [0, 1, ...] on orig_arr. I've found this task quite difficult so leaving it out of scope.

Usage

goal_arr = np.empty(shape=(orig_arr.shape[0], orig_arr.shape[1], 10), dtype=float)
toy_dict = {k:np.random.randint(10, size = 10) for k in range(9)}

orig_arr = np.random.randint(0, 8, size=(2,3))
a = np.array(list(toy_dict.values())) #do not know if it can be optimized
idx = np.indices(orig_arr.shape)
goal_arr[idx[0], idx[1]] = a[orig_arr[idx[0], idx[1]]]

Sample run:

print('orig_arr:\n', orig_arr)
print('toy_dict:\n', toy_dict)
print('goal arr:\n', goal_arr)
---------------------------------
orig_arr:
 [[7 3 0]
 [1 3 2]]
toy_dict:
 {0: array([8, 7, 3, 4, 8, 8, 6, 6, 5, 2]), 1: array([7, 2, 4, 7, 5, 5, 6, 8, 6, 5]), 2: array([5, 3, 4, 7, 6, 8, 6, 4, 4, 7]), 3: array([9, 2, 5, 1, 1, 8, 1, 1, 7, 0]), 4: array([9, 6, 7, 2, 7, 2, 4, 4, 5, 8]), 5: array([4, 9, 5, 2, 8, 3, 9, 4, 7, 9]), 6: array([6, 0, 7, 8, 5, 4, 7, 8, 8, 2]), 7: array([6, 5, 9, 3, 6, 2, 0, 2, 3, 2]), 8: array([5, 3, 9, 3, 2, 3, 0, 8, 3, 5])}
goal arr:
 [[[6. 5. 9. 3. 6. 2. 0. 2. 3. 2.]
  [9. 2. 5. 1. 1. 8. 1. 1. 7. 0.]
  [8. 7. 3. 4. 8. 8. 6. 6. 5. 2.]]

 [[7. 2. 4. 7. 5. 5. 6. 8. 6. 5.]
  [9. 2. 5. 1. 1. 8. 1. 1. 7. 0.]
  [5. 3. 4. 7. 6. 8. 6. 4. 4. 7.]]]

You might also find this excellent tutorial about advanced indexing helpful.

来源：https://stackoverflow.com/questions/63774297/time-efficient-way-to-replace-numpy-entries

标签

python

arrays

numpy

multidimensional-array