Merge Spark output CSV files with a single header

后端 未结 6 811
天命终不由人
天命终不由人 2021-01-01 11:40

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.

I have a Scala script that takes raw data from S3, proc

6条回答
  •  轮回少年
    2021-01-01 11:55

    1. Output the header using dataframe.schema ( val header = dataDF.schema.fieldNames.reduce(_ + "," + _))
    2. create a file with the header on dsefs
    3. append all the partition files (headerless) to the file in #2 using hadoop Filesystem API

提交回复
热议问题