How to save RDD data into json files, not folders

后端未结

关注

 3  1058

挽巷 2020-12-07 04:46

I am receiving the streaming data myDStream (DStream[String]) that I want to save in S3 (basically, for this question, it doesn\'t matter where exa

3条回答

温柔的废话 (楼主)

2020-12-07 05:13
As an alternative to rdd.collect.mkString("\n") you can use hadoop Filesystem library to cleanup output by moving part-00000 file into it's place. Below code works perfectly on local filesystem and HDFS, but I'm unable to test it with S3:
```
val outputPath = "path/to/some/file.json"
rdd.saveAsTextFile(outputPath + "-tmp")

import org.apache.hadoop.fs.Path
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.rename(new Path(outputPath + "-tmp/part-00000"), new Path(outputPath))
fs.delete(new Path(outputPath  + "-tmp"), true)
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...