How can I write a parquet file using Spark (pyspark)?

匿名 (未验证) 提交于 2019-12-03 08:44:33

问题:

I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write'

from pyspark import SparkContext sc = SparkContext("local", "Protob Conversion to Parquet ")  # spark is an existing SparkSession df = sc.textFile("/temp/proto_temp.csv")  # Displays the content of the DataFrame to stdout df.write.parquet("/output/proto.parquet") 

Do you know how to make this work?

The spark version that I'm using is Spark 2.0.1 built for Hadoop 2.7.3.

回答1:

The error was due to the fact that the textFile method from SparkContext returned an RDD and what I needed was a DataFrame.

SparkSession has a SQLContext under the hood. So I needed to use the DataFrameReader to read the CSV file correctly before converting it to a parquet file.

spark = SparkSession \     .builder \     .appName("Protob Conversion to Parquet") \     .config("spark.some.config.option", "some-value") \     .getOrCreate()  # read csv df = spark.read.csv("/temp/proto_temp.csv")  # Displays the content of the DataFrame to stdout df.show()  df.write.parquet("output/proto.parquet") 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!