I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additiona
The first step is to make a DataFrame
from your CSV file.
See Get CSV to Spark dataframe ; the first answer gives a line by line example.
Then you can add the columns. Assume you have a DataFrame
object called df
, and the columns are: [ID
, TYPE
, CODE
].
The rest van be fixed with DataFrame.withColumn()
and pyspark.sql.functions.when
:
from pyspark.sql.functions import when
df_with_extra_columns = df.withColumn("e_TYPE_A", when(df.TYPE == "A", 1).otherwise(0).withColumn("e_TYPE_B", when(df.TYPE == "B", 1).otherwise(0)
(this adds the first two columns. you get the point.)