How to save RDD data into json files, not folders

后端 未结 3 1058
挽巷
挽巷 2020-12-07 04:46

I am receiving the streaming data myDStream (DStream[String]) that I want to save in S3 (basically, for this question, it doesn\'t matter where exa

3条回答
  •  温柔的废话
    2020-12-07 05:13

    As an alternative to rdd.collect.mkString("\n") you can use hadoop Filesystem library to cleanup output by moving part-00000 file into it's place. Below code works perfectly on local filesystem and HDFS, but I'm unable to test it with S3:

    val outputPath = "path/to/some/file.json"
    rdd.saveAsTextFile(outputPath + "-tmp")
    
    import org.apache.hadoop.fs.Path
    val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
    fs.rename(new Path(outputPath + "-tmp/part-00000"), new Path(outputPath))
    fs.delete(new Path(outputPath  + "-tmp"), true)
    

提交回复
热议问题