How to avoid excessive lambda functions in pandas DataFrame assign and apply method chains

问题

I am trying to translate a pipeline of manipulations on a dataframe in R over to its Python equivalent. A basic example of the pipeline is as follows, incorporating a few mutate and filter calls:

library(tidyverse)

calc_circle_area <- function(diam) pi / 4 * diam^2
calc_cylinder_vol <- function(area, length) area * length

raw_data <- tibble(cylinder_name=c('a', 'b', 'c'), length=c(3, 5, 9), diam=c(1, 2, 4))

new_table <- raw_data %>% 
  mutate(area = calc_circle_area(diam)) %>% 
  mutate(vol = calc_cylinder_vol(area, length)) %>% 
  mutate(is_small_vol = vol < 100) %>% 
  filter(is_small_vol)

I can replicate this in pandas without too much trouble but find that it involves some nested lambda calls when using assign to do an apply (first where the dataframe caller is an argument, and subsequently with dataframe rows as the argument). This tends to obscure the meaning of the assign call, where I would like to specify something more to the point (like the R version) if at all possible.

import pandas as pd
import math

calc_circle_area = lambda diam: math.pi / 4 * diam**2
calc_cylinder_vol = lambda area, length: area * length

raw_data = pd.DataFrame({'cylinder_name': ['a', 'b', 'c'], 'length': [3, 5, 9], 'diam': [1, 2, 4]})

new_table = (
    raw_data
        .assign(area=lambda df: df.diam.apply(lambda r: calc_circle_area(r.diam), axis=1))
        .assign(vol=lambda df: df.apply(lambda r: calc_cylinder_vol(r.area, r.length), axis=1))
        .assign(is_small_vol=lambda df: df.vol < 100)
        .loc[lambda df: df.is_small_vol]
)

I am aware that the .assign(area=lambda df: df.diam.apply(calc_circle_area)) could be written as .assign(area=raw_data.diam.apply(calc_circle_area)) but only because the diam column already exists in the original dataframe, which may not always be the case.

I also realize that the calc_... functions here are vectorizable, meaning I could also do things like

.assign(area=lambda df: calc_circle_area(df.diam))
.assign(vol=lambda df: calc_cylinder_vol(df.area, df.length))

but again, since most functions aren't vectorizable, this wouldn't work in most cases.

TL;DR I am wondering if there is a cleaner way to "mutate" columns on a dataframe that doesn't involve double-nesting lambda statements, like in something like:

.assign(vol=lambda df: df.apply(lambda r: calc_cylinder_vol(r.area, r.length), axis=1))

Are there best practices for this type of application or is this the best one can do within the context of method chaining?

回答1:

The best practice is to vectorize operations.

The reason for this is performance, because apply is very slow. You are already taking advantage of vectorization in the R code, and you should continue to do so in Python. You will find that, because of this performance consideration, most of the functions you need actually are vectorizable.

That will get rid of your inner lambdas. For the outer lambdas over the df, I think what you have is the cleanest pattern. The alternative is to repeatedly reassign to the raw_data variable, or some other intermediate variables(s), but this doesn't fit the method chaining style for which you are asking.

There are also Python packages like dfply that aim to mimic the dplyr feel in Python. These do not receive the same level of support as core pandas will, so keep that in mind if you want to go this route.

Or, if you want to just save a bit of typing, and all the functions will be only over columns, you can create a glue function that unpacks the columns for you and passes them along.

def df_apply(col_fn, *col_names):
    def inner_fn(df):
        cols = [df[col] for col in col_names]
        return col_fn(*cols)
    return inner_fn

Then usage ends up looking something like this:

new_table = (
    raw_data
        .assign(area=df_apply(calc_circle_area, 'diam'))
        .assign(vol=df_apply(calc_cylinder_vol, 'area', 'length'))
        .assign(is_small_vol=lambda df: df.vol < 100)
        .loc[lambda df: df.is_small_vol]
)

It is also possible to write this without taking advantage of vectorization, in case that does come up.

def df_apply_unvec(fn, *col_names):
    def inner_fn(df):
        def row_fn(row):
            vals = [row[col] for col in col_names]
            return fn(*vals)
        return df.apply(row_fn, axis=1)
    return inner_fn

I used named functions for extra clarity. But it can be condensed with lambdas into something that looks much like your original format, just generic.

回答2:

as @mcskinner has pointed out, vectorized operations are way better and faster. if however, your operation cannot be vectorized and you still want to apply a function, you could use the pipe method, which should allow for a cleaner method chaining:

import math

def area(df):
    df['area'] = math.pi/4*df['diam']**2
    return df

def vol(df):
    df['vol'] = df['area'] * df['length']
    return df

new_table = (raw_data
             .pipe(area)
             .pipe(vol)
             .assign(is_small_vol = lambda df: df.vol < 100)
             .loc[lambda df: df.is_small_vol]
             )

new_table

    cylinder_name   length  diam    area     vol    is_small_vol
0       a             3      1    0.785398  2.356194    True
1       b             5      2    3.141593  15.707963   True

来源：https://stackoverflow.com/questions/61243071/how-to-avoid-excessive-lambda-functions-in-pandas-dataframe-assign-and-apply-met

标签

python

python-3.x

pandas

tidyverse