How to convert Row in partition

问题

I have a scenario in spark. Have to partition data frame. resultant should be processed by each partition at a time.

List<String> data = Arrays.asList("con_dist_1", "con_dist_2", 
        "con_dist_3", "con_dist_4", "con_dist_5",
        "con_dist_6");
Dataset<Row> codes = sparkSession.createDataset(data, Encoders.STRING());
Dataset<Row> partitioned_codes = codes.repartition(col("codes"));

// I need to paritition it dues to functional requirement
partitioned_codes.foreachPartition(itr -> {
    if (itr.hasNext()) {
        Row inrow = itr.next();
        System.out.println("inrow.length : " + inrow.length());
        System.out.println(inrow.toString());
        List<Object> objs = inrow.getList(0);
    }
});

Getting error

Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to scala.collection.Seq
    at org.apache.spark.sql.Row$class.getSeq(Row.scala:283)
    at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166)
    at org.apache.spark.sql.Row$class.getList(Row.scala:291)
    at org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:166)

Question : How to handle foreachPartition here, where itr each iteration consists a group of Rows, how to get those rows using itr?

Test 1:

inrow.length: 0
[]
inrow.length: 0
[]
2020-03-02 05:22:14,179 [Executor task launch worker for task 615] ERROR org.apache.spark.executor.Executor - Exception in task 110.0 in stage 21.0 (TID 615)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
    at org.apache.spark.sql.Row$class.getString(Row.scala:255)
    at org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:166)

Output-1 :

inrow.length: 0
[]
inrow.length: 0
[]
inrow.length: 1
[con_dist_1]
inrow.length: 1
[con_dist_2]
inrow.length: 1
[con_dist_5]
inrow.length: 1
[con_dist_6]
inrow.length: 1
[con_dist_4]
inrow.length: 1
[con_dist_3]

回答1:

All the rows of the partition are in itr. So when you call itr.next(), you only get the first row. If you need to print all the rows, you can use a while loop, or you can convert the iterator to a list with something like this (I suspect this is what you wanted to get to):

partitioned_codes.foreachPartition(itr -> {
    Iterable<Row> rowIt = () -> itr;
    List<String> objs = StreamSupport.stream(rowIt.spliterator(), false)
            .map(row -> row.getString(0))
            .collect(Collectors.toList());

    System.out.println("inrow.length: " + objs.size());
    System.out.println(objs);
});

The example code you posted didn't compile for me, so here's the version I tested with:

List<String> data = Arrays.asList("con_dist_1", "con_dist_2", 
        "con_dist_3", "con_dist_4", "con_dist_5",
        "con_dist_6");
StructType struct = new StructType()
        .add(DataTypes.createStructField("codes", DataTypes.StringType, true));
Dataset<Row> codes = sparkSession.createDataFrame(sc.parallelize(data, 2)
                        .map(s -> RowFactory.create(s)), struct);
Dataset<Row> partitioned_codes = codes.repartition(org.apache.spark.sql.functions.col("codes"));

partitioned_codes.foreachPartition(itr -> {
    Iterable<Row> rowIt = () -> itr;
    List<String> objs = StreamSupport.stream(rowIt.spliterator(), false)
            .map(row -> row.getString(0))
            .collect(Collectors.toList());

    System.out.println("inrow.length: " + objs.size());
    System.out.println(objs);
});

来源：https://stackoverflow.com/questions/60485240/how-to-convert-row-in-partition

标签

java

scala

apache-spark

java-8

apache-spark-sql