Return multiple columns from pandas apply()

后端 未结 9 2109
灰色年华
灰色年华 2020-11-30 18:43

I have a pandas DataFrame, df_test. It contains a column \'size\' which represents size in bytes. I\'ve calculated KB, MB, and GB using the following code:

相关标签:
9条回答
  • 2020-11-30 19:22

    Some of the current replies work fine, but I want to offer another, maybe more "pandifyed" option. This works for me with the current pandas 0.23 (not sure if it will work in previous versions):

    import pandas as pd
    
    df_test = pd.DataFrame([
      {'dir': '/Users/uname1', 'size': 994933},
      {'dir': '/Users/uname2', 'size': 109338711},
    ])
    
    def sizes(s):
      a = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
      b = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
      c = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
      return a, b, c
    
    df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes, axis=1, result_type="expand")
    

    Notice that the trick is on the result_type parameter of apply, that will expand its result into a DataFrame that can be directly assign to new/old columns.

    0 讨论(0)
  • 2020-11-30 19:25

    Really cool answers! Thanks Jesse and jaumebonet! Just some observation in regards to:

    • zip(* ...
    • ... result_type="expand")

    Although expand is kind of more elegant (pandifyed), zip is at least **2x faster. On this simple example bellow, I got 4x faster.

    import pandas as pd
    
    dat = [ [i, 10*i] for i in range(1000)]
    
    df = pd.DataFrame(dat, columns = ["a","b"])
    
    def add_and_sub(row):
        add = row["a"] + row["b"]
        sub = row["a"] - row["b"]
        return add, sub
    
    df[["add", "sub"]] = df.apply(add_and_sub, axis=1, result_type="expand")
    # versus
    df["add"], df["sub"] = zip(*df.apply(add_and_sub, axis=1))
    
    0 讨论(0)
  • 2020-11-30 19:26

    The performance between the top answers is significantly varied, and Jesse & famaral42 have already discussed this, but it is worth sharing a fair comparison between the top answers, and elaborating on a subtle but important detail of Jesse's answer: the argument passed in to the function, also affects performance.

    (Python 3.7.4, Pandas 1.0.3)

    import pandas as pd
    import locale
    import timeit
    
    
    def create_new_df_test():
        df_test = pd.DataFrame([
          {'dir': '/Users/uname1', 'size': 994933},
          {'dir': '/Users/uname2', 'size': 109338711},
        ])
        return df_test
    
    
    def sizes_pass_series_return_series(series):
        series['size_kb'] = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
        series['size_mb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
        series['size_gb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
        return series
    
    
    def sizes_pass_series_return_tuple(series):
        a = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
        b = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
        c = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
        return a, b, c
    
    
    def sizes_pass_value_return_tuple(value):
        a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
        b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
        c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
        return a, b, c
    

    Here are the results:

    # 1 - Accepted (Nels11 Answer) - (pass series, return series):
    9.82 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # 2 - Pandafied (jaumebonet Answer) - (pass series, return tuple):
    2.34 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # 3 - Tuples (pass series, return tuple then zip):
    1.36 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    # 4 - Tuples (Jesse Answer) - (pass value, return tuple then zip):
    752 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    

    Notice how returning tuples is the fastest method, but what is passed in as an argument, also affects the performance. The difference in the code is subtle but the performance improvement is significant.

    Test #4 (passing in a single value) is twice as fast as test #3 (passing in a series), even though the operation performed is ostensibly identical.

    But there's more...

    # 1a - Accepted (Nels11 Answer) - (pass series, return series, new columns exist):
    3.23 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # 2a - Pandafied (jaumebonet Answer) - (pass series, return tuple, new columns exist):
    2.31 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # 3a - Tuples (pass series, return tuple then zip, new columns exist):
    1.36 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    # 4a - Tuples (Jesse Answer) - (pass value, return tuple then zip, new columns exist):
    694 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    

    In some cases (#1a and #4a), applying the function to a DataFrame in which the output columns already exist is faster than creating them from the function.

    Here is the code for running the tests:

    # Paste and run the following in ipython console. It will not work if you run it from a .py file.
    print('\nAccepted Answer (pass series, return series, new columns dont exist):')
    df_test = create_new_df_test()
    %timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
    print('Accepted Answer (pass series, return series, new columns exist):')
    df_test = create_new_df_test()
    df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
    %timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
    
    print('\nPandafied (pass series, return tuple, new columns dont exist):')
    df_test = create_new_df_test()
    %timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
    print('Pandafied (pass series, return tuple, new columns exist):')
    df_test = create_new_df_test()
    df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
    %timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
    
    print('\nTuples (pass series, return tuple then zip, new columns dont exist):')
    df_test = create_new_df_test()
    %timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
    print('Tuples (pass series, return tuple then zip, new columns exist):')
    df_test = create_new_df_test()
    df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
    %timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
    
    print('\nTuples (pass value, return tuple then zip, new columns dont exist):')
    df_test = create_new_df_test()
    %timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
    print('Tuples (pass value, return tuple then zip, new columns exist):')
    df_test = create_new_df_test()
    df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
    %timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
    
    0 讨论(0)
  • 2020-11-30 19:27

    Use apply and zip will 3 times fast than Series way.

    def sizes(s):    
        return locale.format("%.1f", s / 1024.0, grouping=True) + ' KB', \
            locale.format("%.1f", s / 1024.0 ** 2, grouping=True) + ' MB', \
            locale.format("%.1f", s / 1024.0 ** 3, grouping=True) + ' GB'
    df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes))
    

    Test result are:

    Separate df.apply(): 
    
        100 loops, best of 3: 1.43 ms per loop
    
    Return Series: 
    
        100 loops, best of 3: 2.61 ms per loop
    
    Return tuple:
    
        1000 loops, best of 3: 819 µs per loop
    
    0 讨论(0)
  • 2020-11-30 19:29

    It gives a new dataframe with two columns from the original one.

    import pandas as pd
    df = ...
    df_with_two_columns = df.apply(lambda row:pd.Series([row['column_1'], row['column_2']], index=['column_1', 'column_2']),axis = 1)
    
    0 讨论(0)
  • 2020-11-30 19:30

    Just another readable way. This code will add three new columns and its values, returning series without use parameters in the apply function.

    def sizes(s):
    
        val_kb = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
        val_mb = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
        val_gb = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
        return pd.Series([val_kb,val_mb,val_gb],index=['size_kb','size_mb','size_gb'])
    
    df[['size_kb','size_mb','size_gb']] = df.apply(lambda x: sizes(x) , axis=1)
    

    A general example from: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

    df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
    
    #foo  bar
    #0    1    2
    #1    1    2
    #2    1    2
    
    0 讨论(0)
提交回复
热议问题