I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additiona
First you need to collect distinct values of TYPES and CODE. Then either select add column with name of each value using withColumn or use select fro each column.
Here is sample code using select statement:-
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([
(1, "A", "X1"),
(2, "B", "X2"),
(3, "B", "X3"),
(1, "B", "X3"),
(2, "C", "X2"),
(3, "C", "X2"),
(1, "C", "X1"),
(1, "B", "X1"),
], ["ID", "TYPE", "CODE"])
types = df.select("TYPE").distinct().rdd.flatMap(lambda x: x).collect()
codes = df.select("CODE").distinct().rdd.flatMap(lambda x: x).collect()
types_expr = [F.when(F.col("TYPE") == ty, 1).otherwise(0).alias("e_TYPE_" + ty) for ty in types]
codes_expr = [F.when(F.col("CODE") == code, 1).otherwise(0).alias("e_CODE_" + code) for code in codes]
df = df.select("ID", "TYPE", "CODE", *types_expr+codes_expr)
df.show()
OUTPUT
+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
| 1| A| X1| 1| 0| 0| 1| 0| 0|
| 2| B| X2| 0| 1| 0| 0| 1| 0|
| 3| B| X3| 0| 1| 0| 0| 0| 1|
| 1| B| X3| 0| 1| 0| 0| 0| 1|
| 2| C| X2| 0| 0| 1| 0| 1| 0|
| 3| C| X2| 0| 0| 1| 0| 1| 0|
| 1| C| X1| 0| 0| 1| 1| 0| 0|
| 1| B| X1| 0| 1| 0| 1| 0| 0|
+---+----+----+--------+--------+--------+---------+---------+---------+