发表新帖

发表新帖

Merge Spark output CSV files with a single header

后端未结

关注

 6  811

天命终不由人 2021-01-01 11:40

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.

I have a Scala script that takes raw data from S3, proc

6条回答

轮回少年 (楼主)

2021-01-01 11:55
1. Output the header using dataframe.schema ( val header = dataDF.schema.fieldNames.reduce(_ + "," + _))
2. create a file with the header on dsefs
3. append all the partition files (headerless) to the file in #2 using hadoop Filesystem API
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题