发表新帖

发表新帖

Merge Spark output CSV files with a single header

后端未结

关注

 6  822

天命终不由人 2021-01-01 11:40

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.

I have a Scala script that takes raw data from S3, proc

6条回答

不知归路 (楼主)

2021-01-01 11:45
We had a similar issue, following the below approach to get single output file-
1. Write dataframe to hdfs with headers and without using coalesce or repartition (after the transformations).
```
dataframe.write.format("csv").option("header", "true").save(hdfs_path_for_multiple_files)
```
1. Read the files from the previous step and write back to different location on hdfs with coalesce(1).
```
dataframe = spark.read.option('header', 'true').csv(hdfs_path_for_multiple_files)

dataframe.coalesce(1).write.format('csv').option('header', 'true').save(hdfs_path_for_single_file)
```
This way, you will avoid performance issues related to coalesce or repartition while execution of transformations (Step 1). And the second step provides single output file with one header line.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题