How to split a list to multiple columns in Pyspark?

后端 未结 3 2020
余生分开走
余生分开走 2020-11-27 19:21

I have:

key   value
a    [1,2,3]
b    [2,3,4]

I want:

key value1 value2 value3
a     1      2      3
b     2      3      4
         


        
3条回答
  •  粉色の甜心
    2020-11-27 20:01

    I'd like to add the case of sized lists (arrays) to pault answer.

    In the case that our column contains medium sized arrays (or large sized ones) it is still possible to split them in columns.

    from pyspark.sql.types import *          # Needed to define DataFrame Schema.
    from pyspark.sql.functions import expr   
    
    # Define schema to create DataFrame with an array typed column.
    mySchema = StructType([StructField("V1", StringType(), True),
                           StructField("V2", ArrayType(IntegerType(),True))])
    
    df = spark.createDataFrame([['A', [1, 2, 3, 4, 5, 6, 7]], 
                                ['B', [8, 7, 6, 5, 4, 3, 2]]], schema= mySchema)
    
    # Split list into columns using 'expr()' in a comprehension list.
    arr_size = 7
    df = df.select(['V1', 'V2']+[expr('V2[' + str(x) + ']') for x in range(0, arr_size)])
    
    # It is posible to define new column names.
    new_colnames = ['V1', 'V2'] + ['val_' + str(i) for i in range(0, arr_size)] 
    df = df.toDF(*new_colnames)
    

    The result is:

    df.show(truncate= False)
    
    +---+---------------------+-----+-----+-----+-----+-----+-----+-----+
    |V1 |V2                   |val_0|val_1|val_2|val_3|val_4|val_5|val_6|
    +---+---------------------+-----+-----+-----+-----+-----+-----+-----+
    |A  |[1, 2, 3, 4, 5, 6, 7]|1    |2    |3    |4    |5    |6    |7    |
    |B  |[8, 7, 6, 5, 4, 3, 2]|8    |7    |6    |5    |4    |3    |2    |
    +---+---------------------+-----+-----+-----+-----+-----+-----+-----+
    

提交回复
热议问题