Fastest way to create a pandas column conditionally

后端 未结 1 696
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-30 03:13

In a Pandas DataFrame, I want to create a new column conditionally based on the value of another column. In my application, the DataFrame typically has a few million lines,

相关标签:
1条回答
  • 2020-12-30 04:04

    I'd consider .map (#3) the idiomatic way to do this - but don't pass the .get - use the dictionary by itself, and should see a pretty significant improvement.

    df = pd.DataFrame({'label': np.random.randint(, 4, size=1000000, dtype='i8')})
    
    %timeit df['output'] = df.label.map(lookup_dict.get)
    261 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %timeit df['output'] = df.label.map(lookup_dict)
    69.6 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    If the number of conditions is small, and the comparison cheap (i.e. ints and your lookup table), direct comparison of the values (4 and especially 5) is faster than .map, but this wouldn't always true, e.g. if you had a set of strings.

    If your lookup labels really are contigous integers, you can exploit this and lookup using a take, which should be about as fast as numba. I think this is basically as fast as this can go - could write the the equivalent in cython, but won't be quicker.

    %%timeit
    lookup_arr = np.array(list(lookup_dict.values()))
    df['output'] = lookup_arr.take(df['label'] - 1)
    8.68 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    0 讨论(0)
提交回复
热议问题