问题
Pbm:
a) Read a local file into Panda dataframe say PD_DF b) Manipulate/Massge the PD_DF and add columns to dataframe c) Need to write PD_DF to HDFS using spark. How do I do it ?
回答1:
You can use the SQLContext
object to invoke the createDataFrame
method, which takes an input data
which can optionally be a Pandas DataFrame
object.
回答2:
Lets say dataframe
is of type pandas.core.frame.DataFrame then in spark 2.1 - Pyspark I did this
rdd_data = spark.createDataFrame(dataframe)\
.rdd
In case, if you want to rename any columns or select only few columns, you do them before use of .rdd
Hope it works for you also.
回答3:
I use Spark 1.6.0. First transform pandas dataframe into spark dataframe then spark dataframe spark rdd
sparkDF = sqlContext.createDataFrame(pandasDF)
sparkRDD = sparkDF.rdd.map(list)
type(sparkRDD)
pyspark.rdd.PipelinedRDD
来源:https://stackoverflow.com/questions/29635776/can-i-convert-pandas-dataframe-to-spark-rdd