How to I factorize a list of tuples?

后端未结

关注

 6  2013

definition
factorize: Map each unique object into a unique integer. Typically, the range of integers mapped to is from zero to the n - 1 where n is

相关标签:

6条回答

既然无缘

2020-12-06 10:53

@AChampion's use of setdefault got me wondering whether defaultdict could be used for this problem. So cribbing freely from AC's answer:

In [189]: tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)] In [190]: import collections In [191]: import itertools In [192]: cnt = itertools.count() In [193]: dd = collections.defaultdict(lambda : next(cnt)) In [194]: [dd[t] for t in tups] Out[194]: [0, 1, 2, 3, 4, 1, 2]

Timings in other SO questions show that defaultdict is somewhat slower than the direct use of setdefault. Still the brevity of this approach is attractive.

In [196]: dd Out[196]: defaultdict(<function __main__.<lambda>>, {(1, 2): 0, (3, 4): 2, ('a', 'b'): 1, (6, 'd'): 4, ('c', 5): 3})

0 讨论(0)

发布评论:

提交评论

加载中...

暖寄归人

2020-12-06 10:55

I don't know about timings, but a simple approach would be using numpy.unique along the respective axes.

tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)] res = np.unique(tups, return_inverse=1, axis=0) print res

which yields

(array([['1', '2'], ['3', '4'], ['6', 'd'], ['a', 'b'], ['c', '5']], dtype='|S11'), array([0, 3, 1, 4, 2, 3, 1], dtype=int64))

The array is automatically sorted, but that should not be a problem.

0 讨论(0)

发布评论:

提交评论

加载中...

感情败类

2020-12-06 10:57

I was going to give this answer

pd.factorize([str(x) for x in tups])

However, after running some test, it did not pan out to be the fastest of them all. Since I already did the work, I will show it here for comparison:

@AChampion

%timeit [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups] 1000000 loops, best of 3: 1.66 µs per loop

@Divakar

%timeit np.unique(get_row_view(np.array(tups)), return_inverse=1)[1] # 10000 loops, best of 3: 58.1 µs per loop

@self

%timeit pd.factorize([str(x) for x in tups]) # 10000 loops, best of 3: 65.6 µs per loop

@root

%timeit pd.Series(tups).factorize()[0] # 1000 loops, best of 3: 199 µs per loop

EDIT

For large data with 100K entries, we have:

tups = [(np.random.randint(0, 10), np.random.randint(0, 10)) for i in range(100000)]

@root

%timeit pd.Series(tups).factorize()[0] 100 loops, best of 3: 10.9 ms per loop

@AChampion

%timeit [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups] # 10 loops, best of 3: 16.9 ms per loop

@Divakar

%timeit np.unique(get_row_view(np.array(tups)), return_inverse=1)[1] # 10 loops, best of 3: 81 ms per loop

@self

%timeit pd.factorize([str(x) for x in tups]) 10 loops, best of 3: 87.5 ms per loop

0 讨论(0)

发布评论:

提交评论

加载中...

忘掉有多难

2020-12-06 10:58

Approach #1

Convert each tuple to a row of a 2D array, view each of those rows as one scalar using the views concept of NumPy ndarray and finally use np.unique(... return_inverse=True) to factorize -

np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]

get_row_view is taken from here.

Sample run -

In [23]: tups Out[23]: [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)] In [24]: np.unique(get_row_view(np.array(tups)), return_inverse=1)[1] Out[24]: array([0, 3, 1, 4, 2, 3, 1])

Approach #2

def argsort_unique(idx): # Original idea : https://stackoverflow.com/a/41242285/3293881 n = idx.size sidx = np.empty(n,dtype=int) sidx[idx] = np.arange(n) return sidx def unique_return_inverse_tuples(tups): a = np.array(tups) sidx = np.lexsort(a.T) b = a[sidx] mask0 = ~((b[1:,0] == b[:-1,0]) & (b[1:,1] == b[:-1,1])) ids = np.concatenate(([0], mask0 )) np.cumsum(ids, out=ids) return ids[argsort_unique(sidx)]

Sample run -

In [69]: tups Out[69]: [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)] In [70]: unique_return_inverse_tuples(tups) Out[70]: array([0, 3, 1, 2, 4, 3, 1])

0 讨论(0)

发布评论:

提交评论

加载中...

醉梦人生

2020-12-06 10:59

Initialize your list of tuples as a Series, then call factorize:

pd.Series(tups).factorize()[0] [0 1 2 3 4 1 2]

0 讨论(0)

发布评论:

提交评论

加载中...

一个人的身影

2020-12-06 11:00

A simple way to do it is use a dict to hold previous visits:

>>> d = {} >>> [d.setdefault(tup, i) for i, tup in enumerate(tups)] [0, 1, 2, 3, 4, 1, 2]

If you need to keep the numbers sequential then a slight change:

>>> from itertools import count >>> c = count() >>> [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups] [0, 1, 2, 3, 4, 1, 2, 5]

Or alternatively written:

>>> [d.get(tup) or d.setdefault(tup, next(c)) for tup in tups] [0, 1, 2, 3, 4, 1, 2, 5]

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复