Write each row of a spark dataframe as a separate file

后端 未结 2 1880
时光说笑
时光说笑 2020-12-18 14:12

I have Spark Dataframe with a single column, where each row is a long string (actually an xml file). I want to go through the DataFrame and save a string from each row as a

相关标签:
2条回答
  • 2020-12-18 14:56

    When saving a dataframe with Spark, one file will be created for each partition. Hence, one way to get a single row per file would be to first repartition the data to as many partitions as you have rows.

    There is a library on github for reading and writing XML files with Spark. However, the dataframe needs to have a special format to produce correct XML. In this case, since you have everything as a string in a single column, the easiest way to save would probably be as csv.

    The repartition and saving can be done as follows:

    rows = df.count()
    df.repartition(rows).write.csv('save-dir')
    
    0 讨论(0)
  • 2020-12-18 14:56

    I would do it this way in Java and Hadoop FileSystem API. You can write similar code using Python.

    List<String> strings = Arrays.asList("file1", "file2", "file3");
    JavaRDD<String> stringrdd = new JavaSparkContext().parallelize(strings);
    stringrdd.collect().foreach(x -> {
        Path outputPath = new Path(x);
        Configuration conf = getConf();
        FileSystem fs = FileSystem.get(conf);
        OutputStream os = fs.create(outputPath);
    });
    
    0 讨论(0)
提交回复
热议问题