How to I factorize a list of tuples?

后端 未结 6 2007
春和景丽
春和景丽 2020-12-06 10:43

definition
factorize: Map each unique object into a unique integer. Typically, the range of integers mapped to is from zero to the n - 1 where n is

相关标签:
6条回答
  • 2020-12-06 10:53

    @AChampion's use of setdefault got me wondering whether defaultdict could be used for this problem. So cribbing freely from AC's answer:

    In [189]: tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]
    
    In [190]: import collections
    In [191]: import itertools
    In [192]: cnt = itertools.count()
    In [193]: dd = collections.defaultdict(lambda : next(cnt))
    
    In [194]: [dd[t] for t in tups]
    Out[194]: [0, 1, 2, 3, 4, 1, 2]
    

    Timings in other SO questions show that defaultdict is somewhat slower than the direct use of setdefault. Still the brevity of this approach is attractive.

    In [196]: dd
    Out[196]: 
    defaultdict(<function __main__.<lambda>>,
                {(1, 2): 0, (3, 4): 2, ('a', 'b'): 1, (6, 'd'): 4, ('c', 5): 3})
    
    0 讨论(0)
  • 2020-12-06 10:55

    I don't know about timings, but a simple approach would be using numpy.unique along the respective axes.

    tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]
    res = np.unique(tups, return_inverse=1, axis=0)
    print res
    

    which yields

    (array([['1', '2'],
           ['3', '4'],
           ['6', 'd'],
           ['a', 'b'],
           ['c', '5']],
          dtype='|S11'), array([0, 3, 1, 4, 2, 3, 1], dtype=int64))
    

    The array is automatically sorted, but that should not be a problem.

    0 讨论(0)
  • 2020-12-06 10:57

    I was going to give this answer

    pd.factorize([str(x) for x in tups])
    

    However, after running some test, it did not pan out to be the fastest of them all. Since I already did the work, I will show it here for comparison:

    @AChampion

    %timeit [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]
    1000000 loops, best of 3: 1.66 µs per loop
    

    @Divakar

    %timeit np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]
    # 10000 loops, best of 3: 58.1 µs per loop
    

    @self

    %timeit pd.factorize([str(x) for x in tups])
    # 10000 loops, best of 3: 65.6 µs per loop
    

    @root

    %timeit pd.Series(tups).factorize()[0] 
    # 1000 loops, best of 3: 199 µs per loop
    

    EDIT

    For large data with 100K entries, we have:

    tups = [(np.random.randint(0, 10), np.random.randint(0, 10)) for i in range(100000)]
    

    @root

    %timeit pd.Series(tups).factorize()[0] 
    100 loops, best of 3: 10.9 ms per loop
    

    @AChampion

    %timeit [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]
    
    # 10 loops, best of 3: 16.9 ms per loop
    

    @Divakar

    %timeit np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]
    # 10 loops, best of 3: 81 ms per loop
    

    @self

    %timeit pd.factorize([str(x) for x in tups])
    10 loops, best of 3: 87.5 ms per loop
    
    0 讨论(0)
  • 2020-12-06 10:58

    Approach #1

    Convert each tuple to a row of a 2D array, view each of those rows as one scalar using the views concept of NumPy ndarray and finally use np.unique(... return_inverse=True) to factorize -

    np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]
    

    get_row_view is taken from here.

    Sample run -

    In [23]: tups
    Out[23]: [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]
    
    In [24]: np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]
    Out[24]: array([0, 3, 1, 4, 2, 3, 1])
    

    Approach #2

    def argsort_unique(idx):
        # Original idea : https://stackoverflow.com/a/41242285/3293881 
        n = idx.size
        sidx = np.empty(n,dtype=int)
        sidx[idx] = np.arange(n)
        return sidx
    
    def unique_return_inverse_tuples(tups):
        a = np.array(tups)
        sidx = np.lexsort(a.T)
        b = a[sidx]
        mask0 = ~((b[1:,0] == b[:-1,0]) & (b[1:,1] == b[:-1,1]))
        ids = np.concatenate(([0], mask0  ))
        np.cumsum(ids, out=ids)
        return ids[argsort_unique(sidx)]
    

    Sample run -

    In [69]: tups
    Out[69]: [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]
    
    In [70]: unique_return_inverse_tuples(tups)
    Out[70]: array([0, 3, 1, 2, 4, 3, 1])
    
    0 讨论(0)
  • 2020-12-06 10:59

    Initialize your list of tuples as a Series, then call factorize:

    pd.Series(tups).factorize()[0]
    
    [0 1 2 3 4 1 2]
    
    0 讨论(0)
  • 2020-12-06 11:00

    A simple way to do it is use a dict to hold previous visits:

    >>> d = {}
    >>> [d.setdefault(tup, i) for i, tup in enumerate(tups)]
    [0, 1, 2, 3, 4, 1, 2]
    

    If you need to keep the numbers sequential then a slight change:

    >>> from itertools import count
    >>> c = count()
    >>> [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]
    [0, 1, 2, 3, 4, 1, 2, 5]
    

    Or alternatively written:

    >>> [d.get(tup) or d.setdefault(tup, next(c)) for tup in tups]
    [0, 1, 2, 3, 4, 1, 2, 5]
    
    0 讨论(0)
提交回复
热议问题