Handle string to array conversion in pyspark dataframe

前端 未结 1 457
难免孤独
难免孤独 2020-12-21 13:42

I have a file(csv) which when read in spark dataframe has the below values for print schema

-- list_values: string (nullable = true)

the va

相关标签:
1条回答
  • 2020-12-21 14:14

    Suppose your DataFrame was the following:

    df.show()
    #+----+------------------+
    #|col1|              col2|
    #+----+------------------+
    #|   a|[[[167, 109, 80]]]|
    #+----+------------------+
    
    df.printSchema()
    #root
    # |-- col1: string (nullable = true)
    # |-- col2: string (nullable = true)
    

    You could use pyspark.sql.functions.regexp_replace to remove the leading and trailing square brackets. Once that's done, you can split the resulting string on ", ":

    from pyspark.sql.functions import split, regexp_replace
    
    df2 = df.withColumn(
        "col3",
        split(regexp_replace("col2", r"(^\[\[\[)|(\]\]\]$)", ""), ", ")
    )
    df2.show()
    
    #+----+------------------+--------------+
    #|col1|              col2|          col3|
    #+----+------------------+--------------+
    #|   a|[[[167, 109, 80]]]|[167, 109, 80]|
    #+----+------------------+--------------+
    
    df2.printSchema()
    #root
    # |-- col1: string (nullable = true)
    # |-- col2: string (nullable = true)
    # |-- col3: array (nullable = true)
    # |    |-- element: string (containsNull = true)
    

    If you wanted the column as an array of integers, you could use cast:

    from pyspark.sql.functions import col
    df2 = df2.withColumn("col3", col("col3").cast("array<int>"))
    df2.printSchema()
    #root
    # |-- col1: string (nullable = true)
    # |-- col2: string (nullable = true)
    # |-- col3: array (nullable = true)
    # |    |-- element: integer (containsNull = true)
    
    0 讨论(0)
提交回复
热议问题