How to dynamically chain when conditions in Pyspark?

问题

Context

A dataframe should have the category column, which is based on a set of fixed rules. The set of rules becomes quite large.

Question

Is there a way to use a list of tuples (see example below) to dynamically chain the when conditions to achieve the same result as hard coded solution at the bottom.

# Potential list of rule definitions
category_rules = [
    ('A', 8, 'small'),
    ('A', 30, 'large'),
    ('B', 5, 'small'),
    # Group, size smaller value --> Category
    # and so on ... e.g.,
]

Example

Here is a toy example for reproducibility. A dataframe consisting of groups and ids should have the column category added, which depends on the content of the group column. The list of rules is shown in the section above.

Input data

data = [('A', '45345', 5), ('C', '55345', 5), ('A', '35345', 10), ('B', '65345', 4)]
df = spark.createDataFrame(data, ['group', 'id', 'size'])

+-----+-----+-----+
|group|   id| size|
+-----+-----+-----+
|    A|45345|    5|
|    C|55345|    5|
|    A|35345|   10|
|    B|65345|    4|
+-----+-----+-----+

Hard coded solution

df = df.withColumn(
    'category',
    F.when(
        (F.col('group') == 'A')
        & (F.col('size') < 8),
        F.lit('small')
    ).when(
        (F.col('group') == 'A')
        & (F.col('size') < 30),
        F.lit('large')
    ).when(
        (F.col('group') == 'B')
        & (F.col('size') < 5),
        F.lit('small')
    ).otherwise(
        F.lit('unkown')
    )
)

+-----+-----+----+--------+
|group|   id|size|category|
+-----+-----+----+--------+
|    A|45345|   5|   small|
|    C|55345|   5|  unkown|
|    A|35345|  10|   large|
|    B|65345|   4|   small|
+-----+-----+----+--------+

[Edit 1] Add more complex conditions to explain why chaining is needed.

回答1:

A solution based on the dataframe api:

cond = F.when(F.col('group') == category_rules[0][0], F.lit(category_rules[0][1]))
for c in category_rules[1:]:
    cond = cond.when(F.col('group') == c[0], F.lit(c[1]))
cond = cond.otherwise('unknown')

df.withColumn("category", cond).show()

回答2:

You can use string interpolation to create an expression such as:

CASE 
   WHEN (group = 'A') THEN 'small' 
   WHEN (group = 'B') THEN 'large'
   ELSE 'unkown'
END

And then use it in Spark expression:

from pyspark.sql.functions import expr

data = [('A', '45345'), ('C', '55345'), ('A', '35345'), ('B', '65345')]
df = spark.createDataFrame(data, ['group', 'id'])

category_rules = [('A', 'small'), ('B', 'large')]

when_cases = [f"WHEN (group = '{r[0]}') THEN '{r[1]}'" for r in category_rules]

rules_expr = "CASE " + " ".join(when_cases) + " ELSE 'unkown' END"
# CASE WHEN (group = 'A') THEN 'small' WHEN (group = 'B') THEN 'large' ELSE 'unkown' END

df.withColumn('category', expr(rules_expr)).show()

# +-----+-----+--------+
# |group|   id|category|
# +-----+-----+--------+
# |    A|45345|   small|
# |    C|55345|  unkown|
# |    A|35345|   small|
# |    B|65345|   large|
# +-----+-----+--------+

回答3:

I hope this solution fits you:

Create a new dataframe with the list of tuples you define with the columns 'group' and 'category': category_rules = [('A', 'small'),('B', 'large'), etc] This will be your 'lookup_dataframe'

lookup_df = spark.createDataFrame(category_rules , ['group', 'category'])

Then you can left join both dataframes on the column 'group', so for every row with a group value will get the category value in the column you joined from the lookup_df.

df = df.join(lookup_dataframe, ['group'], 'left')

By making a left join, if there is a group value in your df (on the right side) that´s not included in the lookup_df, like 'C', it will have a null value.

来源：https://stackoverflow.com/questions/64375061/how-to-dynamically-chain-when-conditions-in-pyspark

标签

python

dataframe

apache-spark

pyspark