Add column to Data Frame conditionally in Pyspark

问题

I have a data frame in PySpark. I would like to add a column to the data frame conditionally.

Say If the data frame doesn’t have the column then add a column with null values. If the column is present then do nothing and return the same data frame as a new data frame

How do I pass the conditional statement in PySpark

回答1:

It is not hard but you'll need a bit more than a column name to do it right. Required imports

from pyspark.sql import types as t
from pyspark.sql.functions import lit
from pyspark.sql import DataFrame

Example data:

df = sc.parallelize([("a", 1, [1, 2, 3])]).toDF(["x", "y", "z"])

A helper function (for usage with legacy Python versions strip type annotations):

def add_if_not_present(df: DataFrame, name: str, dtype: t.DataType) -> DataFrame:
    return (df if name in df.columns 
        else df.withColumn(name, lit(None).cast(dtype)))

Example usage:

add_if_not_present(df, "foo", t.IntegerType())

DataFrame[x: string, y: bigint, z: array<bigint>, foo: int]

add_if_not_present(df, "x", t.IntegerType())

DataFrame[x: string, y: bigint, z: array<bigint>]

add_if_not_present(df, "foobar", 
  t.StructType([
      t.StructField("foo", t.IntegerType()), 
      t.StructField("bar", t.IntegerType())]))

DataFrame[x: string, y: bigint, z: array<bigint>, foobar: struct<foo:int,bar:int>]

来源：https://stackoverflow.com/questions/41755003/add-column-to-data-frame-conditionally-in-pyspark

标签

python

apache-spark

dataframe

pyspark

multiple-columns

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!