问题
I am using a list of list with varying sizes. For example alternativesList can include 4 lists in one iteration and 7 lists in the other.
What i am trying to do is capture every combination of words in different lists.
Lets say that
a= [1,2,3]
alternativesList.append(a)
b = ["a","b","c"]
alternativesList.append(b)
productList = itertools.product(*alternativesList)
will create
[(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), (2, 'b'), (2, 'c'), (3, 'a'), (3, 'b'), (3, 'c')]
One problem here is that my productList can be so large it can cause memory problems. So i am using productList as object and iterate over it later.
What i want to know is that is there a way to create same object with numpy which works faster than itertools?
回答1:
Generally speaking, if we consider the optimization as a balance scale memory and runtime would be its two Weighing dishes. This is to say that memory optimization and runtime optimization have an indirect relation together (not always but most of the times). Now, regarding your question:
Is there a way to create same object with numpy which works faster than itertools?
Definitely there are, but another point that you need to notice is that abstraction will give you a much more flexibility and that's what itertools.product
gives you and Numpy don't. If the scalability is not an important facto in this case you can do this with Numpy and don't give up any benefits. Here is one way using column_stack
, repeat
and tile
functions:
In [5]: np.column_stack((np.repeat(a, b.size),np.tile(b, a.size)))
Out[5]:
array([['1', 'a'],
['1', 'b'],
['1', 'c'],
['2', 'a'],
['2', 'b'],
['2', 'c'],
['3', 'a'],
['3', 'b'],
['3', 'c']], dtype='<U21')
Now, still there are some ways to make this array to occupies less memory by using lighter types like U2
, U1
, etc.
In [10]: np.column_stack((np.repeat(a, b.size),np.tile(b, a.size))).astype('U1')
Out[10]:
array([['1', 'a'],
['1', 'b'],
['1', 'c'],
['2', 'a'],
['2', 'b'],
['2', 'c'],
['3', 'a'],
['3', 'b'],
['3', 'c']], dtype='<U1')
回答2:
You can avoid some problems arising from numpy trying to find catchall dtype by explicitly specifying a compound dtype:
Code + some timings:
import numpy as np
import itertools
def cartesian_product_mixed_type(*arrays):
arrays = *map(np.asanyarray, arrays),
dtype = np.dtype([(f'f{i}', a.dtype) for i, a in enumerate(arrays)])
out = np.empty((*map(len, arrays),), dtype)
idx = slice(None), *itertools.repeat(None, len(arrays) - 1)
for i, a in enumerate(arrays):
out[f'f{i}'] = a[idx[:len(arrays) - i]]
return out.ravel()
a = np.arange(4)
b = np.arange(*map(ord, ('A', 'D')), dtype=np.int32).view('U1')
c = np.arange(2.)
np.set_printoptions(threshold=10)
print(f'a={a}')
print(f'b={b}')
print(f'c={c}')
print('itertools')
print(list(itertools.product(a,b,c)))
print('numpy')
print(cartesian_product_mixed_type(a,b,c))
a = np.arange(100)
b = np.arange(*map(ord, ('A', 'z')), dtype=np.int32).view('U1')
c = np.arange(20.)
import timeit
kwds = dict(globals=globals(), number=1000)
print()
print(f'a={a}')
print(f'b={b}')
print(f'c={c}')
print(f"itertools: {timeit.timeit('list(itertools.product(a,b,c))', **kwds):7.4f} ms")
print(f"numpy: {timeit.timeit('cartesian_product_mixed_type(a,b,c)', **kwds):7.4f} ms")
a = np.arange(1000)
b = np.arange(1000, dtype=np.int32).view('U1')
print()
print(f'a={a}')
print(f'b={b}')
print(f"itertools: {timeit.timeit('list(itertools.product(a,b))', **kwds):7.4f} ms")
print(f"numpy: {timeit.timeit('cartesian_product_mixed_type(a,b)', **kwds):7.4f} ms")
Sample output:
a=[0 1 2 3]
b=['A' 'B' 'C']
c=[0. 1.]
itertools
[(0, 'A', 0.0), (0, 'A', 1.0), (0, 'B', 0.0), (0, 'B', 1.0), (0, 'C', 0.0), (0, 'C', 1.0), (1, 'A', 0.0), (1, 'A', 1.0), (1, 'B', 0.0), (1, 'B', 1.0), (1, 'C', 0.0), (1, 'C', 1.0), (2, 'A', 0.0), (2, 'A', 1.0), (2, 'B', 0.0), (2, 'B', 1.0), (2, 'C', 0.0), (2, 'C', 1.0), (3, 'A', 0.0), (3, 'A', 1.0), (3, 'B', 0.0), (3, 'B', 1.0), (3, 'C', 0.0), (3, 'C', 1.0)]
numpy
[(0, 'A', 0.) (0, 'A', 1.) (0, 'B', 0.) ... (3, 'B', 1.) (3, 'C', 0.)
(3, 'C', 1.)]
a=[ 0 1 2 ... 97 98 99]
b=['A' 'B' 'C' ... 'w' 'x' 'y']
c=[ 0. 1. 2. ... 17. 18. 19.]
itertools: 7.4339 ms
numpy: 1.5701 ms
a=[ 0 1 2 ... 997 998 999]
b=['' '\x01' '\x02' ... 'ϥ' 'Ϧ' 'ϧ']
itertools: 62.6357 ms
numpy: 8.0249 ms
来源:https://stackoverflow.com/questions/49475586/python-alternative-to-itertools-product-with-numpy