Merge Spark output CSV files with a single header

后端 未结 6 822
天命终不由人
天命终不由人 2021-01-01 11:40

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.

I have a Scala script that takes raw data from S3, proc

6条回答
  •  不知归路
    2021-01-01 11:45

    We had a similar issue, following the below approach to get single output file-

    1. Write dataframe to hdfs with headers and without using coalesce or repartition (after the transformations).
    dataframe.write.format("csv").option("header", "true").save(hdfs_path_for_multiple_files)
    
    1. Read the files from the previous step and write back to different location on hdfs with coalesce(1).
    dataframe = spark.read.option('header', 'true').csv(hdfs_path_for_multiple_files)
    
    dataframe.coalesce(1).write.format('csv').option('header', 'true').save(hdfs_path_for_single_file)
    

    This way, you will avoid performance issues related to coalesce or repartition while execution of transformations (Step 1). And the second step provides single output file with one header line.

提交回复
热议问题