How to apply a function to two columns of Pandas dataframe

前端 未结 12 1375
名媛妹妹
名媛妹妹 2020-11-22 06:17

Suppose I have a df which has columns of \'ID\', \'col_1\', \'col_2\'. And I define a function :

f = lambda x, y : my_function_expres

12条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-11-22 06:24

    Returning a list from apply is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let's walk through a simple example:

    df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
                      columns=['a', 'b', 'c'])
    df
       a  b  c
    0  4  0  0
    1  2  0  1
    2  2  2  2
    3  1  2  2
    4  3  0  0
    

    There are three possible outcomes with returning a list from apply

    1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.

    df.apply(lambda x: list(range(2)), axis=1)  # returns a Series
    0    [0, 1]
    1    [0, 1]
    2    [0, 1]
    3    [0, 1]
    4    [0, 1]
    dtype: object
    

    2) When the length of the returned list is equal to the number of columns then a DataFrame is returned and each column gets the corresponding value in the list.

    df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
       a  b  c
    0  0  1  2
    1  0  1  2
    2  0  1  2
    3  0  1  2
    4  0  1  2
    

    3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.

    i = 0
    def f(x):
        global i
        if i == 0:
            i += 1
            return list(range(3))
        return list(range(4))
    
    df.apply(f, axis=1) 
    ValueError: Shape of passed values is (5, 4), indices imply (5, 3)
    

    Answering the problem without apply

    Using apply with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.

    Create larger dataframe

    df1 = df.sample(100000, replace=True).reset_index(drop=True)
    

    Timings

    # apply is slow with axis=1
    %timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
    2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # zip - similar to @Thomas
    %timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]  
    29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    @Thomas answer

    %timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
    34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

提交回复
热议问题