How to dynamically chain when conditions in Pyspark?

两盒软妹~` 提交于 2020-12-29 08:42:00

问题


Context

A dataframe should have the category column, which is based on a set of fixed rules. The set of rules becomes quite large.

Question

Is there a way to use a list of tuples (see example below) to dynamically chain the when conditions to achieve the same result as hard coded solution at the bottom.

# Potential list of rule definitions
category_rules = [
    ('A', 8, 'small'),
    ('A', 30, 'large'),
    ('B', 5, 'small'),
    # Group, size smaller value --> Category
    # and so on ... e.g.,
]

Example

Here is a toy example for reproducibility. A dataframe consisting of groups and ids should have the column category added, which depends on the content of the group column. The list of rules is shown in the section above.

Input data
data = [('A', '45345', 5), ('C', '55345', 5), ('A', '35345', 10), ('B', '65345', 4)]
df = spark.createDataFrame(data, ['group', 'id', 'size'])
+-----+-----+-----+
|group|   id| size|
+-----+-----+-----+
|    A|45345|    5|
|    C|55345|    5|
|    A|35345|   10|
|    B|65345|    4|
+-----+-----+-----+
Hard coded solution
df = df.withColumn(
    'category',
    F.when(
        (F.col('group') == 'A')
        & (F.col('size') < 8),
        F.lit('small')
    ).when(
        (F.col('group') == 'A')
        & (F.col('size') < 30),
        F.lit('large')
    ).when(
        (F.col('group') == 'B')
        & (F.col('size') < 5),
        F.lit('small')
    ).otherwise(
        F.lit('unkown')
    )
)
+-----+-----+----+--------+
|group|   id|size|category|
+-----+-----+----+--------+
|    A|45345|   5|   small|
|    C|55345|   5|  unkown|
|    A|35345|  10|   large|
|    B|65345|   4|   small|
+-----+-----+----+--------+

[Edit 1] Add more complex conditions to explain why chaining is needed.


回答1:


A solution based on the dataframe api:

cond = F.when(F.col('group') == category_rules[0][0], F.lit(category_rules[0][1]))
for c in category_rules[1:]:
    cond = cond.when(F.col('group') == c[0], F.lit(c[1]))
cond = cond.otherwise('unknown')

df.withColumn("category", cond).show()



回答2:


You can use string interpolation to create an expression such as:

CASE 
   WHEN (group = 'A') THEN 'small' 
   WHEN (group = 'B') THEN 'large'
   ELSE 'unkown'
END

And then use it in Spark expression:

from pyspark.sql.functions import expr

data = [('A', '45345'), ('C', '55345'), ('A', '35345'), ('B', '65345')]
df = spark.createDataFrame(data, ['group', 'id'])

category_rules = [('A', 'small'), ('B', 'large')]

when_cases = [f"WHEN (group = '{r[0]}') THEN '{r[1]}'" for r in category_rules]

rules_expr = "CASE " + " ".join(when_cases) + " ELSE 'unkown' END"
# CASE WHEN (group = 'A') THEN 'small' WHEN (group = 'B') THEN 'large' ELSE 'unkown' END

df.withColumn('category', expr(rules_expr)).show()

# +-----+-----+--------+
# |group|   id|category|
# +-----+-----+--------+
# |    A|45345|   small|
# |    C|55345|  unkown|
# |    A|35345|   small|
# |    B|65345|   large|
# +-----+-----+--------+



回答3:


I hope this solution fits you:

Create a new dataframe with the list of tuples you define with the columns 'group' and 'category': category_rules = [('A', 'small'),('B', 'large'), etc] This will be your 'lookup_dataframe'

lookup_df = spark.createDataFrame(category_rules , ['group', 'category'])

Then you can left join both dataframes on the column 'group', so for every row with a group value will get the category value in the column you joined from the lookup_df.

df = df.join(lookup_dataframe, ['group'], 'left')

By making a left join, if there is a group value in your df (on the right side) that´s not included in the lookup_df, like 'C', it will have a null value.



来源:https://stackoverflow.com/questions/64375061/how-to-dynamically-chain-when-conditions-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!