Count of all element less than the value in a row

问题

Given a dataframe

value
-----
0.3
0.2
0.7
0.5

is there a way to build a column that contains, for each row, the count of the element in that row that are less or equal than the row value? Specifically,

value   count_less_equal
-------------------------
0.3     2
0.2     1
0.7     4
0.5     3

I could groupBy the value column but I don't know how to filter all values in the row that are less that that value.

I was thinking, maybe it's possible to duplicate the first column, then create a filter so that for each value in col1 one finds the count of values in col2 that are less than col1 value.

col1   col2
-------------------------
0.3     0.3
0.2     0.2
0.7     0.7
0.5     0.5

回答1:

you can use self join and perform join on t1.user_id>=t2.user_id to get the desired result.

    from pyspark.sql import SparkSession

    spark = SparkSession.builder \
        .appName('SO')\
        .getOrCreate()

    sc= spark.sparkContext

    df = sc.parallelize([
        ([0.3]), ([0.2]), ([0.7]), ([0.5])

    ]).toDF(["value"])

    df.show()

    # +-------+
    # |user_id|
    # +-------+
    # |    0.3|
    # |    0.2|
    # |    0.7|
    # |    0.5|
    # +-------+


    df.createTempView("table")

    spark.sql('select t1.value, count(*) as count from table t1 join table t2 on t1.value>=t2.value  group by t1.value order by value').show()

    # +-----+-----+
    # |value|count|
    # +-----+-----+
    # |  0.2|    1|
    # |  0.3|    2|
    # |  0.5|    3|
    # |  0.7|    4|
    # +-----+-----+

回答2:

You can sort them and count the unique values below them using window function

import pyspark.sql.functions as F
from pyspark.sql import Window
tst= sqlContext.createDataFrame(
[
(1,0.3),
(2,0.2),
(3,0.7),
(4,0.5),
(5,0.5),
(3,0.7),
(6,1.0),
(9,0.4)
],schema=['id','val'])
w=Window.orderBy('val') # consider adding a partition column here; If none consider salting
tst1 = tst.withColumn("result",F.size((F.collect_set('val')).over(w)))

tst1.show()
+---+---+------+
| id|val|result|
+---+---+------+
|  2|0.2|     1|
|  1|0.3|     2|
|  9|0.4|     3|
|  5|0.5|     4|
|  4|0.5|     4|
|  3|0.7|     5|
|  3|0.7|     5|
|  6|1.0|     6|
+---+---+------+

来源：https://stackoverflow.com/questions/63114467/count-of-all-element-less-than-the-value-in-a-row

标签

apache-spark

pyspark

apache-spark-sql