get first N elements from dataframe ArrayType column in pyspark

后端 未结 2 1088
逝去的感伤
逝去的感伤 2020-12-09 20:26

I have a spark dataframe with rows as -

1   |   [a, b, c]
2   |   [d, e, f]
3   |   [g, h, i]

Now I want to keep only the first 2 elements

2条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-09 20:44

    Either my pyspark skills have gone rusty (I confess I don't hone them much anymore nowadays), or this is a tough nut indeed... The only way I managed to do it is by using SQL statements:

    spark.version
    #  u'2.3.1'
    
    # dummy data:
    
    from pyspark.sql import Row
    x = [Row(col1="xx", col2="yy", col3="zz", col4=[123,234, 456])]
    rdd = sc.parallelize(x)
    df = spark.createDataFrame(rdd)
    df.show()
    # result:
    +----+----+----+---------------+
    |col1|col2|col3|           col4|
    +----+----+----+---------------+
    |  xx|  yy|  zz|[123, 234, 456]|
    +----+----+----+---------------+
    
    df.createOrReplaceTempView("df")
    df2 = spark.sql("SELECT col1, col2, col3, (col4[0], col4[1]) as col5 FROM df")
    df2.show()
    # result:
    +----+----+----+----------+ 
    |col1|col2|col3|      col5|
    +----+----+----+----------+ 
    |  xx|  yy|  zz|[123, 234]|
    +----+----+----+----------+
    

    For future questions, it would be good to follow the suggested guidelines on How to make good reproducible Apache Spark Dataframe examples.

提交回复
热议问题