How to merge two columns of a `Dataframe` in Spark into one 2-Tuple?

后端 未结 4 1962
野趣味
野趣味 2020-12-14 22:14

I have a Spark DataFrame df with five columns. I want to add another column with its values being the tuple of the first and second columns. When u

相关标签:
4条回答
  • 2020-12-14 22:46

    If you want to merge two dataframe columns into one column. Just:

    import org.apache.spark.sql.functions.array
    df.withColumn("NewColumn", array("columnA", "columnB"))
    
    0 讨论(0)
  • 2020-12-14 23:00

    You can merge multiple dataframe columns into one using array.

    // $"*" will capture all existing columns
    df.select($"*", array($"col1", $"col2").as("newCol")) 
    
    0 讨论(0)
  • 2020-12-14 23:04

    You can use a User-defined function udf to achieve what you want.

    UDF definition

    object TupleUDFs {
      import org.apache.spark.sql.functions.udf      
      // type tag is required, as we have a generic udf
      import scala.reflect.runtime.universe.{TypeTag, typeTag}
    
      def toTuple2[S: TypeTag, T: TypeTag] = 
        udf[(S, T), S, T]((x: S, y: T) => (x, y))
    }
    

    Usage

    df.withColumn(
      "tuple_col", TupleUDFs.toTuple2[Int, Int].apply(df("a"), df("b"))
    )
    

    assuming "a" and "b" are the columns of type Int you want to put in a tuple.

    0 讨论(0)
  • 2020-12-14 23:10

    You can use struct function which creates a tuple of provided columns:

    import org.apache.spark.sql.functions.struct
    
    val df = Seq((1,2), (3,4), (5,3)).toDF("a", "b")
    df.withColumn("NewColumn", struct(df("a"), df("b")).show(false)
    
    +---+---+---------+
    |a  |b  |NewColumn|
    +---+---+---------+
    |1  |2  |[1,2]    |
    |3  |4  |[3,4]    |
    |5  |3  |[5,3]    |
    +---+---+---------+
    
    0 讨论(0)
提交回复
热议问题