I have a pyspark dataframe:
id | column
------------------------------
1 | [0.2, 2, 3, 4, 3, 0.5]
------------------------------
2 | [7, 0.3, 0.3
Here's a way you can try:
import pyspark.sql.functions as F
# using map filter the list and count based on condition
s = (df
.select('column')
.rdd
.map(lambda x: [[i for i in x.column if i < 2],
[i for i in x.column if i > 2],
[i for i in x.column if i == 2]])
.map(lambda x: [Row(round(sum(i), 2)) for i in x]))
.toDF(['col<2','col>2','col=2'])
# create a dummy id so we can join both data frames
df = df.withColumn('mid', F.monotonically_increasing_id())
s = s.withColumn('mid', F.monotonically_increasing_id())
#simple left join
df = df.join(s, on='mid').drop('mid').show()
+---+--------------------+-----+------+-----+
| id| column|col<2| col>2|col=2|
+---+--------------------+-----+------+-----+
| 0|[0.2, 2.0, 3.0, 4...|[0.7]|[10.0]|[2.0]|
| 1|[7.0, 0.3, 0.3, 8...|[0.6]|[15.0]|[2.0]|
+---+--------------------+-----+------+-----+