pyspark type error on reading a pandas dataframe

问题

I read some CSV file into pandas, nicely preprocessed it and set dtypes to desired values of float, int, category. However, when trying to import it into spark I get the following error:

Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>

After trying to trace it for a while I some source for my troubles -> see the CSV file:

"myColumns"
""
"A"

Red into pandas like: small = pd.read_csv(os.path.expanduser('myCsv.csv'))

And failing to import it to spark with:

sparkDF = spark.createDataFrame(small)

Currently I use Spark 2.0.0

Possibly multiple columns are affected. How can I deal with this problem?

回答1:

You'll need to define the spark DataFrame schema explicitly and pass it to the createDataFrame function :

from pyspark.sql.types import *
import pandas as pd

small = pdf.read_csv("data.csv")
small.head()
#  myColumns
# 0       NaN
# 1         A
sch = StructType([StructField("myColumns", StringType(), True)])

df = spark.createDataFrame(small, sch)
df.show()
# +---------+
# |myColumns|
# +---------+
# |      NaN|
# |        A|
# +---------+

df.printSchema()
# root
# |-- myColumns: string (nullable = true)

来源：https://stackoverflow.com/questions/39888188/pyspark-type-error-on-reading-a-pandas-dataframe

标签

python

csv

pandas

apache-spark

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!