TypeError converting a Pandas Dataframe to Spark Dataframe in Pyspark

问题

Did my research, but didn't find anything on this. I want to convert a simple pandas.DataFrame to a spark dataframe, like this:

df = pd.DataFrame({'col1': ['a', 'b', 'c'], 'col2': [1, 2, 3]})
sc_sql.createDataFrame(df, schema=df.columns.tolist())

The error I get is:

TypeError: Can not infer schema for type: <class 'str'>

I tried something even simpler:

df = pd.DataFrame([1, 2, 3])
sc_sql.createDataFrame(df)

And I get:

TypeError: Can not infer schema for type: <class 'numpy.int64'>

Any help? Do manually need to specify a schema or so?

sc_sql is a pyspark.sql.SQLContext, I am in a jupyter notebook on python 3.4 and spark 1.6.

Thanks!

回答1:

It's related to your spark version, latest update of spark makes type inference more intelligent. You could have fixed this by adding the schema like this :

mySchema = StructType([ StructField("col1", StringType(), True), StructField("col2", IntegerType(), True)])
sc_sql.createDataFrame(df,schema=mySchema)

来源：https://stackoverflow.com/questions/37409920/typeerror-converting-a-pandas-dataframe-to-spark-dataframe-in-pyspark

标签

python

pandas

apache-spark

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!