First non-null value per row from a list of Pandas columns

前端 未结 9 1228
难免孤独
难免孤独 2020-11-27 19:23

If I\'ve got a DataFrame in pandas which looks something like:

    A   B   C
0   1 NaN   2
1 NaN   3 NaN
2 NaN   4   5
3 NaN NaN NaN

How ca

9条回答
  •  春和景丽
    2020-11-27 19:54

    I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:

    def get_first_non_null(df):
        a = df.values
        col_index = np.isnan(a).argmin(axis=1)
        return [a[row, col] for row, col in enumerate(col_index)]
    

    EDIT: Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.

    def get_first_non_null_vec(df):
        a = df.values
        n_rows, n_cols = a.shape
        col_index = np.isnan(a).argmin(axis=1)
        flat_index = n_cols * np.arange(n_rows) + col_index
        return a.ravel()[flat_index]
    

    If a row is completely null then the corresponding value will be null also. Here's some benchmarking against unutbu's solution:

    df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
    #%timeit df.stack().groupby(level=0).first().reindex(df.index)
    %timeit get_first_non_null(df)
    %timeit get_first_non_null_vec(df)
    1 loops, best of 3: 220 ms per loop
    100 loops, best of 3: 16.2 ms per loop
    100 loops, best of 3: 12.6 ms per loop
    In [109]:
    
    
    df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
    #%timeit df.stack().groupby(level=0).first().reindex(df.index)
    %timeit get_first_non_null(df)
    %timeit get_first_non_null_vec(df)
    1 loops, best of 3: 246 ms per loop
    10 loops, best of 3: 48.2 ms per loop
    100 loops, best of 3: 15.7 ms per loop
    
    
    df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
    %timeit df.stack().groupby(level=0).first().reindex(df.index)
    %timeit get_first_non_null(df)
    %timeit get_first_non_null_vec(df)
    1 loops, best of 3: 326 ms per loop
    1 loops, best of 3: 326 ms per loop
    10 loops, best of 3: 35.7 ms per loop
    

提交回复
热议问题