问题
I am trying to use rowNumber in Spark data frames. My queries are working as expected in Spark shell. But when i write them out in eclipse and compile a jar, i am facing an error
16/03/23 05:52:43 ERROR ApplicationMaster: User class threw exception:org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
My queries
import org.apache.spark.sql.functions.{rowNumber, max, broadcast}
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"id").orderBy($"value".desc)
val dfTop = df.withColumn("rn", rowNumber.over(w)).where($"rn" <= 3).drop("rn")
I am not using HiveContext while running the queries in Spark shell. Not sure why it is returning an error when i run the same as a jar file. And also I am running the scripts on Spark 1.6.0 if that helps. Did anyone face similar issue?
回答1:
I have already answered a similar question before. The error message says all. With spark < version 2.x, you'll need a HiveContext
in your application jar, no other way around.
You can read further about the difference between SQLContextand HiveContext here.
SparkSQL
has a SQLContext
and a HiveContext
. HiveContext
is a super set of the SQLContext
. The Spark community suggest using the HiveContext
. You can see that when you run spark-shell, which is your interactive driver application, it automatically creates a SparkContext
defined as sc and a HiveContext
defined as sqlContext
. The HiveContext
allows you to execute SQL queries as well as Hive commands.
You can try to check that inside of your spark-shell
:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
scala> sqlContext.isInstanceOf[org.apache.spark.sql.hive.HiveContext]
res0: Boolean = true
scala> sqlContext.isInstanceOf[org.apache.spark.sql.SQLContext]
res1: Boolean = true
scala> sqlContext.getClass.getName
res2: String = org.apache.spark.sql.hive.HiveContext
By inheritance, HiveContext
is actually an SQLContext
, but it's not true the other way around. You can check the source code if you are more intersted in knowing how does HiveContext
inherits from SQLContext
.
Since spark 2.0, you'll just need to create a SparkSession
(as the single entry point) which removes the HiveContext
/SQLContext
confusion issue.
回答2:
For Spark 2.0, it is recommended to use SparkSession
as the single entry point. It eliminates the HiveContext
/SqlContext
confusion issue.
import org.apache.spark.sql.SparkSession
val session = SparkSession.builder
.master("local")
.appName("application name")
.getOrCreate()
Check out this databricks article for how to use it.
来源:https://stackoverflow.com/questions/36171349/using-windowing-functions-in-spark