Is there a Spark built-in that flattens nested arrays?

问题

I have a DataFrame field that is a Seq[Seq[String]] I built a UDF to transform said column into a column of Seq[String]; basically, a UDF for the flatten function from Scala.

def combineSentences(inCol: String, outCol: String): DataFrame => DataFrame = {

    def flatfunc(seqOfSeq: Seq[Seq[String]]): Seq[String] = seqOfSeq match {
        case null => Seq.empty[String]
        case _ => seqOfSeq.flatten
    }
    df: DataFrame => df.withColumn(outCol, udf(flatfunc _).apply(col(inCol)))
}

My use case is strings, but obviously, this could be generic. You can use this function in a chain of DataFrame transforms like:

df.transform(combineSentences(inCol, outCol))

Is there a Spark built-in function that does the same thing? I have not been able to find one.

回答1:

There is a similar function (since Spark 2.4) and it is called flatten:

import org.apache.spark.sql.functions.flatten

From the official documentation:

def flatten(e: Column): Column

Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.

Since

2.4.0

To get the exact equivalent you'll have to coalesce to replace NULL.

来源：https://stackoverflow.com/questions/54271283/is-there-a-spark-built-in-that-flattens-nested-arrays

标签

scala

apache-spark

apache-spark-sql

user-defined-functions

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!