How to save RDD data into json files, not folders

后端未结

关注

 3  1069

I am receiving the streaming data myDStream (DStream[String]) that I want to save in S3 (basically, for this question, it doesn\'t matter where exa

相关标签:

3条回答

温柔的废话

2020-12-07 05:13

As an alternative to rdd.collect.mkString("\n") you can use hadoop Filesystem library to cleanup output by moving part-00000 file into it's place. Below code works perfectly on local filesystem and HDFS, but I'm unable to test it with S3:

val outputPath = "path/to/some/file.json"
rdd.saveAsTextFile(outputPath + "-tmp")

import org.apache.hadoop.fs.Path
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.rename(new Path(outputPath + "-tmp/part-00000"), new Path(outputPath))
fs.delete(new Path(outputPath  + "-tmp"), true)

0 讨论(0)

情歌与酒

2020-12-07 05:32

For JAVA I implemented this one. Hope it helps:

    val fs = FileSystem.get(spark.sparkContext().hadoopConfiguration());
    File dir = new File(System.getProperty("user.dir") + "/my.csv/");
    File[] files = dir.listFiles((d, name) -> name.endsWith(".csv"));
    fs.rename(new Path(files[0].toURI()), new Path(System.getProperty("user.dir") + "/csvDirectory/newData.csv"));
    fs.delete(new Path(System.getProperty("user.dir") + "/my.csv/"), true);

0 讨论(0)

無奈伤痛

2020-12-07 05:37

AFAIK there is no option to save it as a file. Because it's a distributed processing framework and it's not a good practice write on single file rather than each partition writes it's own files in the specified path.

We can pass only output directory where we wanted to save the data. OutputWriter will create file(s)(depends on partitions) inside specified path with part- file name prefix.

0 讨论(0)
发布评论:

提交评论
- 加载中...