Merge Spark output CSV files with a single header

后端 未结 6 816
天命终不由人
天命终不由人 2021-01-01 11:40

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.

I have a Scala script that takes raw data from S3, proc

6条回答
  •  Happy的楠姐
    2021-01-01 11:58

    To merge files in a folder into one file:

    import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.fs._
    
    def merge(srcPath: String, dstPath: String): Unit =  {
      val hadoopConfig = new Configuration()
      val hdfs = FileSystem.get(hadoopConfig)
      FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
    }
    

    If you want to merge all files into one file, but still in the same folder (but this brings all data to the driver node):

    dataFrame
          .coalesce(1)
          .write
          .format("com.databricks.spark.csv")
          .option("header", "true")
          .save(out)
    

    Another solution would be to use solution #2 then move the one file inside the folder to another path (with the name of our CSV file).

    def df2csv(df: DataFrame, fileName: String, sep: String = ",", header: Boolean = false): Unit = {
        val tmpDir = "tmpDir"
    
        df.repartition(1)
          .write
          .format("com.databricks.spark.csv")
          .option("header", header.toString)
          .option("delimiter", sep)
          .save(tmpDir)
    
        val dir = new File(tmpDir)
        val tmpCsvFile = tmpDir + File.separatorChar + "part-00000"
        (new File(tmpCsvFile)).renameTo(new File(fileName))
    
        dir.listFiles.foreach( f => f.delete )
        dir.delete
    }
    

提交回复
热议问题