How to merge two columns of a `Dataframe` in Spark into one 2-Tuple?

后端 未结 4 1967
野趣味
野趣味 2020-12-14 22:14

I have a Spark DataFrame df with five columns. I want to add another column with its values being the tuple of the first and second columns. When u

4条回答
  •  旧巷少年郎
    2020-12-14 23:04

    You can use a User-defined function udf to achieve what you want.

    UDF definition

    object TupleUDFs {
      import org.apache.spark.sql.functions.udf      
      // type tag is required, as we have a generic udf
      import scala.reflect.runtime.universe.{TypeTag, typeTag}
    
      def toTuple2[S: TypeTag, T: TypeTag] = 
        udf[(S, T), S, T]((x: S, y: T) => (x, y))
    }
    

    Usage

    df.withColumn(
      "tuple_col", TupleUDFs.toTuple2[Int, Int].apply(df("a"), df("b"))
    )
    

    assuming "a" and "b" are the columns of type Int you want to put in a tuple.

提交回复
热议问题