I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.
I have a Scala script that takes raw data from S3, proc
We had a similar issue, following the below approach to get single output file-
coalesce or repartition (after the transformations).dataframe.write.format("csv").option("header", "true").save(hdfs_path_for_multiple_files)
coalesce(1).dataframe = spark.read.option('header', 'true').csv(hdfs_path_for_multiple_files)
dataframe.coalesce(1).write.format('csv').option('header', 'true').save(hdfs_path_for_single_file)
This way, you will avoid performance issues related to coalesce or repartition while execution of transformations (Step 1). And the second step provides single output file with one header line.