Apply a function to a single column of a csv in Spark

问题

Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this?

My code

SparkContext().addPyFile("myfile.py")
spark = SparkSession\
    .builder\
    .appName("myApp")\
    .getOrCreate()
from myfile import myFunction

df = spark.read.csv(sys.argv[1], header=True,
    mode="DROPMALFORMED",)
a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF()

I would like to be able to just call the function on the column name instead of mapping each row to line and then calling the function on line[index].

I'm using Spark version 2.0.1

回答1:

You can simply use User Defined Functions (udf) combined with a withColumn :

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

udf_myFunction = udf(myFunction, IntegerType()) # if the function returns an int
df.withColumn("message", udf_myFunction("_3")) #"_3" being the column name of the column you want to consider

This will add a new column to the dataframe df containing the result of myFunction(line[3]).

来源：https://stackoverflow.com/questions/40977625/apply-a-function-to-a-single-column-of-a-csv-in-spark

标签

apache-spark

pyspark

spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!