Converting rdd to dataframe: AttributeError: 'RDD' object has no attribute 'toDF' [duplicate]

半世苍凉 提交于 2020-05-27 09:17:46

问题


from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf().setAppName("myApp").setMaster("local")
sc = SparkContext(conf=conf)

a = sc.parallelize([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]]).toDF(['ind', "state"])

a.show()

Results in:

Traceback (most recent call last):
  File "/Users/ktemlyakov/messing_around/SparkStuff/mock_maersk_data.py", line 7, in <module>
    a = sc.parallelize([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]]).toDF(['ind', "state"])
AttributeError: 'RDD' object has no attribute 'toDF'

What am I missing?


回答1:


sqlContext is missing; it needs to be created. The following code works:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark import sql

conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)

a = sc.parallelize([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]]).toDF(['ind', "state"])

a.show()

Edit:

In Spark 2.0, the above can be achieved with:

from pyspark import SparkConf
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").config(conf=SparkConf()).getOrCreate()

a = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]], ['ind', "state"])
a.show()



回答2:


u can directly do this

a_df = a.toDF()
type(a_df)


来源:https://stackoverflow.com/questions/47341048/converting-rdd-to-dataframe-attributeerror-rdd-object-has-no-attribute-todf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!