According to the docs, the collect_set and collect_list functions should be available in Spark SQL. However, I cannot get it to work. I\'m running
Spark 2.0+:
SPARK-10605 introduced native collect_list and collect_set implementation. SparkSession with Hive support or HiveContext are no longer required.
Spark 2.0-SNAPSHOT (before 2016-05-03):
You have to enable Hive support for a given SparkSession:
In Scala:
val spark = SparkSession.builder
.master("local")
.appName("testing")
.enableHiveSupport() // <- enable Hive support.
.getOrCreate()
In Python:
spark = (SparkSession.builder
.enableHiveSupport()
.getOrCreate())
Spark < 2.0:
To be able to use Hive UDFs (see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) you have use Spark built with Hive support (this is already covered when you use pre-built binaries what seems to be the case here) and initialize SparkContext using HiveContext.
In Scala:
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
val sqlContext: SQLContext = new HiveContext(sc)
In Python:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)