可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write'

from pyspark import SparkContext sc = SparkContext("local", "Protob Conversion to Parquet ")  # spark is an existing SparkSession df = sc.textFile("/temp/proto_temp.csv")  # Displays the content of the DataFrame to stdout df.write.parquet("/output/proto.parquet")

Do you know how to make this work?

The spark version that I'm using is Spark 2.0.1 built for Hadoop 2.7.3.

回答1:

The error was due to the fact that the textFile method from SparkContext returned an RDD and what I needed was a DataFrame.

SparkSession has a SQLContext under the hood. So I needed to use the DataFrameReader to read the CSV file correctly before converting it to a parquet file.

spark = SparkSession \     .builder \     .appName("Protob Conversion to Parquet") \     .config("spark.some.config.option", "some-value") \     .getOrCreate()  # read csv df = spark.read.csv("/temp/proto_temp.csv")  # Displays the content of the DataFrame to stdout df.show()  df.write.parquet("output/proto.parquet")

文章来源: How can I write a parquet file using Spark (pyspark)?

标签

parquet