definition
factorize: Map each unique object into a unique integer. Typically, the range of integers mapped to is from zero to the n - 1 where n is
@AChampion's
use of setdefault
got me wondering whether defaultdict
could be used for this problem. So cribbing freely from AC's answer:
In [189]: tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]
In [190]: import collections
In [191]: import itertools
In [192]: cnt = itertools.count()
In [193]: dd = collections.defaultdict(lambda : next(cnt))
In [194]: [dd[t] for t in tups]
Out[194]: [0, 1, 2, 3, 4, 1, 2]
Timings in other SO questions show that defaultdict
is somewhat slower than the direct use of setdefault
. Still the brevity of this approach is attractive.
In [196]: dd
Out[196]:
defaultdict(<function __main__.<lambda>>,
{(1, 2): 0, (3, 4): 2, ('a', 'b'): 1, (6, 'd'): 4, ('c', 5): 3})
I don't know about timings, but a simple approach would be using numpy.unique
along the respective axes.
tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]
res = np.unique(tups, return_inverse=1, axis=0)
print res
which yields
(array([['1', '2'],
['3', '4'],
['6', 'd'],
['a', 'b'],
['c', '5']],
dtype='|S11'), array([0, 3, 1, 4, 2, 3, 1], dtype=int64))
The array is automatically sorted, but that should not be a problem.
I was going to give this answer
pd.factorize([str(x) for x in tups])
However, after running some test, it did not pan out to be the fastest of them all. Since I already did the work, I will show it here for comparison:
@AChampion
%timeit [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]
1000000 loops, best of 3: 1.66 µs per loop
@Divakar
%timeit np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]
# 10000 loops, best of 3: 58.1 µs per loop
@self
%timeit pd.factorize([str(x) for x in tups])
# 10000 loops, best of 3: 65.6 µs per loop
@root
%timeit pd.Series(tups).factorize()[0]
# 1000 loops, best of 3: 199 µs per loop
EDIT
For large data with 100K entries, we have:
tups = [(np.random.randint(0, 10), np.random.randint(0, 10)) for i in range(100000)]
@root
%timeit pd.Series(tups).factorize()[0]
100 loops, best of 3: 10.9 ms per loop
@AChampion
%timeit [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]
# 10 loops, best of 3: 16.9 ms per loop
@Divakar
%timeit np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]
# 10 loops, best of 3: 81 ms per loop
@self
%timeit pd.factorize([str(x) for x in tups])
10 loops, best of 3: 87.5 ms per loop
Approach #1
Convert each tuple to a row of a 2D
array, view each of those rows as one scalar using the views
concept of NumPy ndarray and finally use np.unique(... return_inverse=True)
to factorize -
np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]
get_row_view
is taken from here.
Sample run -
In [23]: tups
Out[23]: [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]
In [24]: np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]
Out[24]: array([0, 3, 1, 4, 2, 3, 1])
Approach #2
def argsort_unique(idx):
# Original idea : https://stackoverflow.com/a/41242285/3293881
n = idx.size
sidx = np.empty(n,dtype=int)
sidx[idx] = np.arange(n)
return sidx
def unique_return_inverse_tuples(tups):
a = np.array(tups)
sidx = np.lexsort(a.T)
b = a[sidx]
mask0 = ~((b[1:,0] == b[:-1,0]) & (b[1:,1] == b[:-1,1]))
ids = np.concatenate(([0], mask0 ))
np.cumsum(ids, out=ids)
return ids[argsort_unique(sidx)]
Sample run -
In [69]: tups
Out[69]: [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]
In [70]: unique_return_inverse_tuples(tups)
Out[70]: array([0, 3, 1, 2, 4, 3, 1])
Initialize your list of tuples as a Series, then call factorize
:
pd.Series(tups).factorize()[0]
[0 1 2 3 4 1 2]
A simple way to do it is use a dict
to hold previous visits:
>>> d = {}
>>> [d.setdefault(tup, i) for i, tup in enumerate(tups)]
[0, 1, 2, 3, 4, 1, 2]
If you need to keep the numbers sequential then a slight change:
>>> from itertools import count
>>> c = count()
>>> [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]
[0, 1, 2, 3, 4, 1, 2, 5]
Or alternatively written:
>>> [d.get(tup) or d.setdefault(tup, next(c)) for tup in tups]
[0, 1, 2, 3, 4, 1, 2, 5]