Reading csv data into SparkR after writing it out from a DataFrame

匿名 (未验证) 提交于 2019-12-03 01:08:02

问题:

I followed the example in this post to write out a DataFrame as a csv to an AWS S3 bucket. The result was not a single file but rather a folder with many .csv files. I'm now having trouble reading in this folder as a DataFrame in SparkR. Below is what I've tried but they do not result in the same DataFrame that I wrote out.

write.df(df, 's3a://bucket/df', source="csv") #Creates a folder named df in S3 bucket  df_in1 <- read.df("s3a://bucket/df", source="csv") df_in2 <- read.df("s3a://bucket/df/*.csv", source="csv") #Neither df_in1 or df_in2 result in DataFrames that are the same as df

回答1:

#  Spark 1.4 is used in this example #  # Download the nyc flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv  # Launch SparkR using  # ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3  # The SparkSQL context should already be created for you as sqlContext sqlContext # Java ref type org.apache.spark.sql.SQLContext id 1  # Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here. flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")  # Print the first few rows head(flights)

Hope this example helps.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!