How to I factorize a list of tuples?

后端 未结 6 2006
春和景丽
春和景丽 2020-12-06 10:43

definition
factorize: Map each unique object into a unique integer. Typically, the range of integers mapped to is from zero to the n - 1 where n is

6条回答
  •  感情败类
    2020-12-06 10:57

    I was going to give this answer

    pd.factorize([str(x) for x in tups])
    

    However, after running some test, it did not pan out to be the fastest of them all. Since I already did the work, I will show it here for comparison:

    @AChampion

    %timeit [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]
    1000000 loops, best of 3: 1.66 µs per loop
    

    @Divakar

    %timeit np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]
    # 10000 loops, best of 3: 58.1 µs per loop
    

    @self

    %timeit pd.factorize([str(x) for x in tups])
    # 10000 loops, best of 3: 65.6 µs per loop
    

    @root

    %timeit pd.Series(tups).factorize()[0] 
    # 1000 loops, best of 3: 199 µs per loop
    

    EDIT

    For large data with 100K entries, we have:

    tups = [(np.random.randint(0, 10), np.random.randint(0, 10)) for i in range(100000)]
    

    @root

    %timeit pd.Series(tups).factorize()[0] 
    100 loops, best of 3: 10.9 ms per loop
    

    @AChampion

    %timeit [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]
    
    # 10 loops, best of 3: 16.9 ms per loop
    

    @Divakar

    %timeit np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]
    # 10 loops, best of 3: 81 ms per loop
    

    @self

    %timeit pd.factorize([str(x) for x in tups])
    10 loops, best of 3: 87.5 ms per loop
    

提交回复
热议问题