Spark SQL sql(“<some aggregate query>”).first().getDouble(0) give me inconsistent results

放肆的年华 提交于 2019-12-24 17:04:30

问题


I have the below query which is supposed to find an average of the column values and return me the result which is a single number.

val avgVal = hiveContext.sql("select round(avg(amount), 4) from users.payment where dt between '2018-05-09' and '2018-05-09'").first().getDouble(0)

I'm facing inconsistent behavior at this statement. This often fails with below error however it gives non-NULL results when executed through Hive."

18/05/10 11:01:12 ERROR ApplicationMaster: User class threw exception: java.lang.NullPointerException: Value at index 0 in null
java.lang.NullPointerException: Value at index 0 in null
    at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:475)
    at org.apache.spark.sql.Row$class.getDouble(Row.scala:243)
    at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:192)

The reason why I use HiveContext instead of SQLContext is that the later doesn't support some of the aggregation functions which I use extensively in my code.

Could you please help me understand why this problem occurs and how to solve?


回答1:


You need to divide query and get into two parts:

var result = hiveContext.sql("select round(avg(amount), 4) from users.payment where dt between '2018-05-09' and '2018-05-09'");
var first = result.first();
if (first != null && !first.isNullAt(0)) {
var avgVal = first.getDouble(0);
}

This would avoid NPE. This would also be needed in List and array.

For insert or update query, you even need to surround with try...catch block to catch runtime exception.




回答2:


Let's analyze the case and possible reasons when this exception can be thrown.

Row row = hiveContext.sql("select info, name, desc, id from users.payment where dt between '2018-05-09' and '2018-05-09'").first();

If the value of the row above returned something like:

[null, Kevin, cash, 300]

Trying to get getDouble(0) will lead to java.lang.NullPointerException: Value at index 0 in null

You can try the following approach:

Row row = hiveContext.sql("select round(avg(amount), 4) from users.payment where dt between '2018-05-09' and '2018-05-09'").first();

if (!row.isNullAt(0))
   double d = row.getDouble(0);
else
   logger.error("Value at index zero is null");

If you'll check the sources, the library class doing opposite:

private static Object getAnyValAs(Row $this, int i) {
    if($this.isNullAt(i)) {
        throw new NullPointerException((new StringContext(scala.Predef..MODULE$.wrapRefArray((Object[])(new String[]{"Value at index ", " is null"})))).s(scala.Predef..MODULE$.genericWrapArray(new Object[]{BoxesRunTime.boxToInteger(i)})));
    } else {
        return $this.getAs(i);
    }
}


来源:https://stackoverflow.com/questions/50269797/spark-sql-sqlsome-aggregate-query-first-getdouble0-give-me-inconsisten

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!