Java Spark : Stack Overflow Error on GroupBy

社会主义新天地 提交于 2021-02-07 16:07:08

问题


I am using Spark 2.3.1 with Java.

I have a Dataset, which I want to group to make some aggregations (let's say a count() for the example). The grouping must be done according to a given list of columns.

My function is the following :

public Dataset<Row> compute(Dataset<Row> data, List<String> columns){

    final List<Column> columns_col = new ArrayList<Column>();

    for (final String tag : columns) {
        columns_col.add(new Column(tag));
    }

    Seq<Column> columns_seq = JavaConverters.asScalaIteratorConverter(columns_col.iterator()).asScala().toSeq();

    System.out.println("My columns : "+columns_seq.mkString(", "));
    System.out.println("Data count : "+data.count());

    final Dataset<Row> dataset_count = data.groupBy(columns_seq).agg(count(col("value")));

    System.out.println("Result count : "+dataset_count.count()); 

    return dataset_count;
}       

And when I call it like this :

Dataset<Row> df = compute(MyDataset, Arrays.asList("field1","field2","field3","field4"));

I have a StackOverflowError on the dataset_count.count() :

My columns : field1, field2, field3, field4
Data count : 136821
Exception in thread "main" java.lang.StackOverflowError
    at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
    at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)
    at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)
    at scala.collection.immutable.Stream.drop(Stream.scala:858)
    at scala.collection.immutable.Stream.drop(Stream.scala:202)
    at scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:64)
    at scala.collection.immutable.Stream.apply(Stream.scala:202)
    ...

But if I replace in my functions the line

final Dataset<Row> dataset_count = data.groupBy(columns_seq).agg(count(col("value")));

by

final Dataset<Row> dataset_count = data.groupBy("field1","field2","field3","field4").agg(count(col("value")));

I have no error, and my program compute well :

My columns : field1, field2, field3, field4
Data count : 136821
Result count : 74698

Where does this problem might come from and is there a solution for grouping dataset according to a list of unknown columns ?


回答1:


Try to use this instead :

Seq<Column> columns_seq = JavaConversions.asScalaBuffer(columns_col).seq();



回答2:


Replacing

JavaConverters.asScalaIteratorConverter(columns_col.iterator()).asScala().toSeq();

for:

JavaConversions.asScalaBuffer(columns_col).seq()

did the work for me (fully tested).



来源:https://stackoverflow.com/questions/51667368/java-spark-stack-overflow-error-on-groupby

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!