In Apache Spark, converting JavaRDD to Dataset gives exception: ArrayList is not a valid external type for schema of string

问题

I am using hbase-spark connector to fetch hbase data into spark JavaRDD<Row> (which I feel I able to do successfully since I am able to print the hbase data that is fetched). Then, I am trying to convert that JavaRDD<Row> to Dataset<Row>. But it gives me error which is given further in the post. First let me start how my code looks like.

private static JavaRDD<Row> loadHBaseRDD() throws ParseException
{
    //form list of row keys
    List<byte[]> rowKeys = new ArrayList<byte[]>(5);
    //consider ids is class level variable
    ids.forEach(id -> {
        rowKeys.add(Bytes.toBytes(id));     
    });
    JavaRDD<byte[]> rdd = jsc.parallelize(rowKeys);

    //make hbase-spark connector call 
    JavaRDD resultJRDD = jhbc.bulkGet(TableName.valueOf("table1"),2,rdd,new GetFunction(),new ResultFunction());

    return resultJRDD;
}

Notice that bulkGet() accepts instances GetFunction and RsultFunction classes. GetFunction class has single method which returns instance of Get class (from hbase client):

public static class GetFunction implements Function<byte[], Get> {
    private static final long serialVersionUID = 1L;
    public Get call(byte[] v) throws Exception {
        return new Get(v);
    }
}

The ResultFunction has a function which converts instance of Result (hbase client class) to Row:

public static class ResultFunction implements Function<Result, Row> 
{
    private static final long serialVersionUID = 1L;
    public Row call(Result result) throws Exception 
    {
        List<String> values = new ArrayList<String>(); //notice this is arraylist, we talk about this latter

        for (Cell cell : result.rawCells()) {
            values.add(Bytes.toString(CellUtil.cloneValue(cell)));
        }
        return RowFactory.create(values);
    }
}

When I call loadHBaseRDD() and print the returned value, it prints the values correctly:

JavaRDD<Row> hbaseJavaRDD = loadHBaseRDD();
hbaseJavaRDD.foreach(row -> { 
    System.out.println(row);   //this prints rows correctly
});

It means rows have been correctly fetched from hbase to spark. Now I want to convert JavaRDD<Row> to Dataset<Row> as explained here. Thus I first create StructType:

StructType schema = //create schema

Then I try converting JavaRDD to dataframe:

Dataset<Row> hbaseDataFrame = sparksession1.createDataFrame(hbaseJavaRDD, schema);
hbaseDataFrame.show(false);

This throws exception with very big stacktrace (only part of which is shown below) occurring at line hbaseDataFrame.show(false) with first line as follows:

java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.util.ArrayList is not a valid external type for schema of string

It seems that, because values is of type ArrayList inside ResultFunction.call(), it is giving exception java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.util.ArrayList is not a valid external type for schema of string.

There is [similar question] on stackoveflow which has answer saying that instead of list, one should return String[][]. Though I dont get the reasoning behind returning String[][], I modified ResultFunction to have values of type String[][]:

public static class ResultFunction implements Function<Result, Row> 
{
    private static final long serialVersionUID = 1L;
    public Row call(Result result) throws Exception 
    {
        String[] values = new String[result.rawCells().length];
        String[][] valuesWrapped = new String[1][]; 

        for(int i=0;i<result.rawCells().length;i++)
        {
            values[i] = Bytes.toString(CellUtil.cloneValue(result.rawCells()[i]));
        }
        valuesWrapped[0] = values;
        return RowFactory.create(valuesWrapped);
    }
}

It gives below exception at same line hbaseDataFrame.show(false):

java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: [[Ljava.lang.String; is not a valid external type for schema of string

Finally I modified ResultFunction class again to have values variable of type String[]:

public static class ResultFunction implements Function<Result, Row>
{
    private static final long serialVersionUID = 1L;
    public Row call(Result result) throws Exception 
    {
        String[] values = new String[result.rawCells().length];     
        for(int i=0;i<result.rawCells().length;i++)
        {
            values[i] = Bytes.toString(CellUtil.cloneValue(result.rawCells()[i]));
        }
        return values;
    }
}

And this is giving me exception with big stack trace having starting line:

java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 14

So what might be going wrong here? And how I am supposed to do this?

回答1:

The last approach (of returning String[] values) was correct. The issues was with ill formed schema. It seems that I somehow ended up having one more column in the schema than is present in the data. (Thanks to the extra space character in the schema string containing columns separated by the single space. Extra space was creating extra column.)

来源：https://stackoverflow.com/questions/50623475/in-apache-spark-converting-javarddrow-to-datasetrow-gives-exception-arrayl

标签

java

apache-spark

apache-spark-sql

hbase

In Apache Spark, converting JavaRDD<Row> to Dataset<Row> gives exception: ArrayList is not a valid external type for schema of string

问题

回答1: