Difference in dense rank and row number in spark

前端 未结 1 2058
遇见更好的自我
遇见更好的自我 2020-12-13 14:57

I tried to understand the difference between dense rank and row number.Each new window partition both is starting from 1. Does rank of a row is not always start from 1 ? Any

相关标签:
1条回答
  • 2020-12-13 15:24

    The difference is when there are "ties" in the ordering column. Check the example below:

    import org.apache.spark.sql.expressions.Window
    import org.apache.spark.sql.functions._
    
    val df = Seq(("a", 10), ("a", 10), ("a", 20)).toDF("col1", "col2")
    
    val windowSpec = Window.partitionBy("col1").orderBy("col2")
    
    df
      .withColumn("rank", rank().over(windowSpec))
      .withColumn("dense_rank", dense_rank().over(windowSpec))
      .withColumn("row_number", row_number().over(windowSpec)).show
    
    +----+----+----+----------+----------+
    |col1|col2|rank|dense_rank|row_number|
    +----+----+----+----------+----------+
    |   a|  10|   1|         1|         1|
    |   a|  10|   1|         1|         2|
    |   a|  20|   3|         2|         3|
    +----+----+----+----------+----------+
    

    Note that the value "10" exists twice in col2 within the same window (col1 = "a"). That's when you see a difference between the three functions.

    0 讨论(0)
提交回复
热议问题