Is there a Spark built-in that flattens nested arrays?

天涯浪子 提交于 2019-12-25 02:29:01

问题


I have a DataFrame field that is a Seq[Seq[String]] I built a UDF to transform said column into a column of Seq[String]; basically, a UDF for the flatten function from Scala.

def combineSentences(inCol: String, outCol: String): DataFrame => DataFrame = {

    def flatfunc(seqOfSeq: Seq[Seq[String]]): Seq[String] = seqOfSeq match {
        case null => Seq.empty[String]
        case _ => seqOfSeq.flatten
    }
    df: DataFrame => df.withColumn(outCol, udf(flatfunc _).apply(col(inCol)))
}

My use case is strings, but obviously, this could be generic. You can use this function in a chain of DataFrame transforms like:

df.transform(combineSentences(inCol, outCol))

Is there a Spark built-in function that does the same thing? I have not been able to find one.


回答1:


There is a similar function (since Spark 2.4) and it is called flatten:

import org.apache.spark.sql.functions.flatten

From the official documentation:

def flatten(e: Column): Column

Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.

Since

2.4.0

To get the exact equivalent you'll have to coalesce to replace NULL.



来源:https://stackoverflow.com/questions/54271283/is-there-a-spark-built-in-that-flattens-nested-arrays

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!