SPARK SQL replacement for mysql GROUP_CONCAT aggregate function

问题

I have a table of two string type columns (username, friend) and for each username, I want to collect all of it\'s friends on one row, concatenated as strings (\'username1\', \'friends1, friends2, friends3\'). I know MySql does this by GROUP_CONCAT, is there any way to do this with SPARK SQL?

Thanks

回答1:

Before you proceed: This operations is yet another another groupByKey. While it has multiple legitimate applications it is relatively expensive so be sure to use it only when required.

Not exactly concise or efficient solution but you can use UserDefinedAggregateFunction introduced in Spark 1.5.0:

object GroupConcat extends UserDefinedAggregateFunction {
    def inputSchema = new StructType().add("x", StringType)
    def bufferSchema = new StructType().add("buff", ArrayType(StringType))
    def dataType = StringType
    def deterministic = true 

    def initialize(buffer: MutableAggregationBuffer) = {
      buffer.update(0, ArrayBuffer.empty[String])
    }

    def update(buffer: MutableAggregationBuffer, input: Row) = {
      if (!input.isNullAt(0)) 
        buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))
    }

    def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
      buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0))
    }

    def evaluate(buffer: Row) = UTF8String.fromString(
      buffer.getSeq[String](0).mkString(","))
}

Example usage:

val df = sc.parallelize(Seq(
  ("username1", "friend1"),
  ("username1", "friend2"),
  ("username2", "friend1"),
  ("username2", "friend3")
)).toDF("username", "friend")

df.groupBy($"username").agg(GroupConcat($"friend")).show

## +---------+---------------+
## | username|        friends|
## +---------+---------------+
## |username1|friend1,friend2|
## |username2|friend1,friend3|
## +---------+---------------+

You can also create a Python wrapper as shown in Spark: How to map Python with Scala or Java User Defined Functions?

In practice it can be faster to extract RDD, groupByKey, mkString and rebuild DataFrame.

You can get a similar effect by combining collect_list function (Spark >= 1.6.0) with concat_ws:

import org.apache.spark.sql.functions.{collect_list, udf, lit}

df.groupBy($"username")
  .agg(concat_ws(",", collect_list($"friend")).alias("friends"))

回答2:

You can try the collect_list function

sqlContext.sql("select A, collect_list(B), collect_list(C) from Table1 group by A

Or you can regieter a UDF something like

sqlContext.udf.register("myzip",(a:Long,b:Long)=>(a+","+b))

and you can use this function in the query

sqlConttext.sql("select A,collect_list(myzip(B,C)) from tbl group by A")

回答3:

Here is a function you can use in PySpark:

import pyspark.sql.functions as F

def group_concat(col, distinct=False, sep=','):
    if distinct:
        collect = F.collect_set(col.cast(StringType()))
    else:
        collect = F.collect_list(col.cast(StringType()))
    return F.concat_ws(sep, collect)


table.groupby('username').agg(F.group_concat('friends').alias('friends'))

In SQL:

select username, concat_ws(',', collect_list(friends)) as friends
from table
group by username

回答4:

One way to do it with pyspark < 1.6, which unfortunately doesn't support user-defined aggregate function:

byUsername = df.rdd.reduceByKey(lambda x, y: x + ", " + y)

and if you want to make it a dataframe again:

sqlContext.createDataFrame(byUsername, ["username", "friends"])

As of 1.6, you can use collect_list and then join the created list:

from pyspark.sql import functions as F
from pyspark.sql.types import StringType
join_ = F.udf(lambda x: ", ".join(x), StringType())
df.groupBy("username").agg(join_(F.collect_list("friend").alias("friends"))

回答5:

Language: Scala Spark version: 1.5.2

I had the same issue and also tried to resolve it using udfs but, unfortunately, this has led to more problems later in the code due to type inconsistencies. I was able to work my way around this by first converting the DF to an RDD then grouping by and manipulating the data in the desired way and then converting the RDD back to a DF as follows:

val df = sc
     .parallelize(Seq(
        ("username1", "friend1"),
        ("username1", "friend2"),
        ("username2", "friend1"),
        ("username2", "friend3")))
     .toDF("username", "friend")

+---------+-------+
| username| friend|
+---------+-------+
|username1|friend1|
|username1|friend2|
|username2|friend1|
|username2|friend3|
+---------+-------+

val dfGRPD = df.map(Row => (Row(0), Row(1)))
     .groupByKey()
     .map{ case(username:String, groupOfFriends:Iterable[String]) => (username, groupOfFriends.mkString(","))}
     .toDF("username", "groupOfFriends")

+---------+---------------+
| username| groupOfFriends|
+---------+---------------+
|username1|friend2,friend1|
|username2|friend3,friend1|
+---------+---------------+

回答6:

Below python-based code that achieves group_concat functionality.

Input Data:

Cust_No,Cust_Cars

1, Toyota

2, BMW

1, Audi

2, Hyundai

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import pyspark.sql.functions as F

spark = SparkSession.builder.master('yarn').getOrCreate()

# Udf to join all list elements with "|"
def combine_cars(car_list,sep='|'):
  collect = sep.join(car_list)
  return collect

test_udf = udf(combine_cars,StringType())
car_list_per_customer.groupBy("Cust_No").agg(F.collect_list("Cust_Cars").alias("car_list")).select("Cust_No",test_udf("car_list").alias("Final_List")).show(20,False)

Output Data: Cust_No, Final_List

1, Toyota|Audi

2, BMW|Hyundai

来源：https://stackoverflow.com/questions/31640729/spark-sql-replacement-for-mysql-group-concat-aggregate-function

标签

apache-spark

aggregate-functions

apache-spark-sql