问题
Given a dataframe
value
-----
0.3
0.2
0.7
0.5
is there a way to build a column that contains, for each row, the count of the element in that row that are less or equal than the row value? Specifically,
value count_less_equal
-------------------------
0.3 2
0.2 1
0.7 4
0.5 3
I could groupBy the value column but I don't know how to filter all values in the row that are less that that value.
I was thinking, maybe it's possible to duplicate the first column, then create a filter so that for each value in col1
one finds the count of values in col2
that are less than col1
value.
col1 col2
-------------------------
0.3 0.3
0.2 0.2
0.7 0.7
0.5 0.5
回答1:
you can use self join and perform join on t1.user_id>=t2.user_id
to get the desired result.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
sc= spark.sparkContext
df = sc.parallelize([
([0.3]), ([0.2]), ([0.7]), ([0.5])
]).toDF(["value"])
df.show()
# +-------+
# |user_id|
# +-------+
# | 0.3|
# | 0.2|
# | 0.7|
# | 0.5|
# +-------+
df.createTempView("table")
spark.sql('select t1.value, count(*) as count from table t1 join table t2 on t1.value>=t2.value group by t1.value order by value').show()
# +-----+-----+
# |value|count|
# +-----+-----+
# | 0.2| 1|
# | 0.3| 2|
# | 0.5| 3|
# | 0.7| 4|
# +-----+-----+
回答2:
You can sort them and count the unique values below them using window function
import pyspark.sql.functions as F
from pyspark.sql import Window
tst= sqlContext.createDataFrame(
[
(1,0.3),
(2,0.2),
(3,0.7),
(4,0.5),
(5,0.5),
(3,0.7),
(6,1.0),
(9,0.4)
],schema=['id','val'])
w=Window.orderBy('val') # consider adding a partition column here; If none consider salting
tst1 = tst.withColumn("result",F.size((F.collect_set('val')).over(w)))
tst1.show()
+---+---+------+
| id|val|result|
+---+---+------+
| 2|0.2| 1|
| 1|0.3| 2|
| 9|0.4| 3|
| 5|0.5| 4|
| 4|0.5| 4|
| 3|0.7| 5|
| 3|0.7| 5|
| 6|1.0| 6|
+---+---+------+
来源:https://stackoverflow.com/questions/63114467/count-of-all-element-less-than-the-value-in-a-row