get first N elements from dataframe ArrayType column in pyspark

后端未结

关注

 2  1088

逝去的感伤 2020-12-09 20:26

I have a spark dataframe with rows as -

1   |   [a, b, c]
2   |   [d, e, f]
3   |   [g, h, i]

Now I want to keep only the first 2 elements

2条回答

夕颜 (楼主)

2020-12-09 20:44

Either my pyspark skills have gone rusty (I confess I don't hone them much anymore nowadays), or this is a tough nut indeed... The only way I managed to do it is by using SQL statements:

spark.version
#  u'2.3.1'

# dummy data:

from pyspark.sql import Row
x = [Row(col1="xx", col2="yy", col3="zz", col4=[123,234, 456])]
rdd = sc.parallelize(x)
df = spark.createDataFrame(rdd)
df.show()
# result:
+----+----+----+---------------+
|col1|col2|col3|           col4|
+----+----+----+---------------+
|  xx|  yy|  zz|[123, 234, 456]|
+----+----+----+---------------+

df.createOrReplaceTempView("df")
df2 = spark.sql("SELECT col1, col2, col3, (col4[0], col4[1]) as col5 FROM df")
df2.show()
# result:
+----+----+----+----------+ 
|col1|col2|col3|      col5|
+----+----+----+----------+ 
|  xx|  yy|  zz|[123, 234]|
+----+----+----+----------+

For future questions, it would be good to follow the suggested guidelines on How to make good reproducible Apache Spark Dataframe examples.

0 讨论(0)

查看其它2个回答