How to merge two columns of a `Dataframe` in Spark into one 2-Tuple?

后端未结

关注

 4  1967

野趣味 2020-12-14 22:14

I have a Spark DataFrame df with five columns. I want to add another column with its values being the tuple of the first and second columns. When u

4条回答

旧巷少年郎 (楼主)

2020-12-14 23:04

You can use a User-defined function udf to achieve what you want.

UDF definition

object TupleUDFs {
  import org.apache.spark.sql.functions.udf      
  // type tag is required, as we have a generic udf
  import scala.reflect.runtime.universe.{TypeTag, typeTag}

  def toTuple2[S: TypeTag, T: TypeTag] = 
    udf[(S, T), S, T]((x: S, y: T) => (x, y))
}

Usage

df.withColumn(
  "tuple_col", TupleUDFs.toTuple2[Int, Int].apply(df("a"), df("b"))
)

assuming "a" and "b" are the columns of type Int you want to put in a tuple.

0 讨论(0)

查看其它4个回答