How to zip two array columns in Spark SQL

后端 未结 3 799
南方客
南方客 2020-11-30 14:30

I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with \'_\'. My d

3条回答
  •  离开以前
    2020-11-30 15:09

    A Spark SQL equivalent of Python's would be pyspark.sql.functions.arrays_zip:

    pyspark.sql.functions.arrays_zip(*cols)

    Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.

    So if you already have two arrays:

    from pyspark.sql.functions import split
    
    df = (spark
        .createDataFrame([('abc, def, ghi', '1.0, 2.0, 3.0')])
        .toDF("column_1", "column_2")
        .withColumn("column_1", split("column_1", "\s*,\s*"))
        .withColumn("column_2", split("column_2", "\s*,\s*")))
    

    You can just apply it on the result

    from pyspark.sql.functions import arrays_zip
    
    df_zipped = df.withColumn(
      "zipped", arrays_zip("column_1", "column_2")
    )
    
    df_zipped.select("zipped").show(truncate=False)
    
    +------------------------------------+
    |zipped                              |
    +------------------------------------+
    |[[abc, 1.0], [def, 2.0], [ghi, 3.0]]|
    +------------------------------------+
    

    Now to combine the results you can transform (How to use transform higher-order function?, TypeError: Column is not iterable - How to iterate over ArrayType()?):

    df_zipped_concat = df_zipped.withColumn(
        "zipped_concat",
         expr("transform(zipped, x -> concat_ws('_', x.column_1, x.column_2))")
    ) 
    
    df_zipped_concat.select("zipped_concat").show(truncate=False)
    
    +---------------------------+
    |zipped_concat              |
    +---------------------------+
    |[abc_1.0, def_2.0, ghi_3.0]|
    +---------------------------+
    

    Note:

    Higher order functions transform and arrays_zip has been introduced in Apache Spark 2.4.

提交回复
热议问题