get first N elements from dataframe ArrayType column in pyspark

后端 未结 2 1078
逝去的感伤
逝去的感伤 2020-12-09 20:26

I have a spark dataframe with rows as -

1   |   [a, b, c]
2   |   [d, e, f]
3   |   [g, h, i]

Now I want to keep only the first 2 elements

2条回答
  •  执笔经年
    2020-12-09 20:40

    Here's how to do it with the API functions.

    Suppose your DataFrame were the following:

    df.show()
    #+---+---------+
    #| id|  letters|
    #+---+---------+
    #|  1|[a, b, c]|
    #|  2|[d, e, f]|
    #|  3|[g, h, i]|
    #+---+---------+
    
    df.printSchema()
    #root
    # |-- id: long (nullable = true)
    # |-- letters: array (nullable = true)
    # |    |-- element: string (containsNull = true)
    

    You can use square brackets to access elements in the letters column by index, and wrap that in a call to pyspark.sql.functions.array() to create a new ArrayType column.

    import pyspark.sql.functions as f
    
    df.withColumn("first_two", f.array([f.col("letters")[0], f.col("letters")[1]])).show()
    #+---+---------+---------+
    #| id|  letters|first_two|
    #+---+---------+---------+
    #|  1|[a, b, c]|   [a, b]|
    #|  2|[d, e, f]|   [d, e]|
    #|  3|[g, h, i]|   [g, h]|
    #+---+---------+---------+
    

    Or if you had too many indices to list, you can use a list comprehension:

    df.withColumn("first_two", f.array([f.col("letters")[i] for i in range(2)])).show()
    #+---+---------+---------+
    #| id|  letters|first_two|
    #+---+---------+---------+
    #|  1|[a, b, c]|   [a, b]|
    #|  2|[d, e, f]|   [d, e]|
    #|  3|[g, h, i]|   [g, h]|
    #+---+---------+---------+
    

    For pyspark versions 2.4+ you can also use pyspark.sql.functions.slice():

    df.withColumn("first_two",f.slice("letters",start=1,length=2)).show()
    #+---+---------+---------+
    #| id|  letters|first_two|
    #+---+---------+---------+
    #|  1|[a, b, c]|   [a, b]|
    #|  2|[d, e, f]|   [d, e]|
    #|  3|[g, h, i]|   [g, h]|
    #+---+---------+---------+
    

    slice may have better performance for large arrays (note that start index is 1, not 0)

提交回复
热议问题