问题
I'm using this tweets dataset with Pyspark in order to process it and get some trends according to the tweet's location. But I'm having a problem when I try to create the dataframe. I'm using spark.read.options(header="True").csv("hashtag_donaldtrump.csv")
to create the dataframe, but if I look at the tweets column, this is the result I get:
Do you know how can I clean the CSV file so it can be processed by Spark? Thank you in advance!
回答1:
It looks like a multiline csv. Try doing
df = spark.read.csv("hashtag_donaldtrump.csv", header=True, multiLine=True)
来源:https://stackoverflow.com/questions/65723811/how-to-read-multiline-csv-file-in-pyspark