问题
I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write'
from pyspark import SparkContext
sc = SparkContext("local", "Protob Conversion to Parquet ")
# spark is an existing SparkSession
df = sc.textFile("/temp/proto_temp.csv")
# Displays the content of the DataFrame to stdout
df.write.parquet("/output/proto.parquet")
Do you know how to make this work?
The spark version that I'm using is Spark 2.0.1 built for Hadoop 2.7.3.
回答1:
The error was due to the fact that the textFile
method from SparkContext
returned an RDD
and what I needed was a DataFrame
.
SparkSession has a SQLContext
under the hood. So I needed to use the DataFrameReader
to read the CSV file correctly before converting it to a parquet file.
spark = SparkSession \
.builder \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# read csv
df = spark.read.csv("/temp/proto_temp.csv")
# Displays the content of the DataFrame to stdout
df.show()
df.write.parquet("output/proto.parquet")
来源:https://stackoverflow.com/questions/42022890/how-can-i-write-a-parquet-file-using-spark-pyspark