I have a spark dataframe with rows as -
1 | [a, b, c]
2 | [d, e, f]
3 | [g, h, i]
Now I want to keep only the first 2 elements
Either my pyspark skills have gone rusty (I confess I don't hone them much anymore nowadays), or this is a tough nut indeed... The only way I managed to do it is by using SQL statements:
spark.version
# u'2.3.1'
# dummy data:
from pyspark.sql import Row
x = [Row(col1="xx", col2="yy", col3="zz", col4=[123,234, 456])]
rdd = sc.parallelize(x)
df = spark.createDataFrame(rdd)
df.show()
# result:
+----+----+----+---------------+
|col1|col2|col3| col4|
+----+----+----+---------------+
| xx| yy| zz|[123, 234, 456]|
+----+----+----+---------------+
df.createOrReplaceTempView("df")
df2 = spark.sql("SELECT col1, col2, col3, (col4[0], col4[1]) as col5 FROM df")
df2.show()
# result:
+----+----+----+----------+
|col1|col2|col3| col5|
+----+----+----+----------+
| xx| yy| zz|[123, 234]|
+----+----+----+----------+
For future questions, it would be good to follow the suggested guidelines on How to make good reproducible Apache Spark Dataframe examples.