From the PySpark docs rangeBetween:
rangeBetween(start, end)Defines the frame boundaries, from start (inclusive) to end (inc
rowsBetween: - With rowsBetween you define a boundary frame of rows to calculate, which frame is calculated independently.
Frame in rowsBetween does not depend on orderBy clause.
df = spark.read.csv(r'C:\Users\akashSaini\Desktop\TT.csv',inferSchema =True, header=True).na.drop()
w =Window.partitionBy('DEPARTMENT').orderBy('SALARY').rowsBetween(Window.unboundedPreceding,Window.currentRow)
df.withColumn('RowsBetween', F.sum(df.SALARY).over(w)).show()
first_name|Department|Salary|RowsBetween|
Sofia| Sales| 20000| 20000|
Gordon| Sales| 25000| 45000|
Gracie| Sales| 25000| 70000|
Cellie| Sales| 25000| 95000|
Jervis| Sales| 30000|125000|
Akash| Analysis| 30000| 30000|
Richard| Account| 12000| 12000|
Joelly| Account| 15000| 27000|
Carmiae| Account| 15000| 42000|
Bob| Account| 20000| 62000|
Gally| Account| 28000| 90000
rangeBetween: - With rangeBetween, you define a boundary frame of rows to calculate, which may change.
Frame in rowsBetween depends on orderBy clause. rangeBetween will include all the rows which has same value in orderBy clause like Gordon, Gracie and Cellie have same salary so included with the current frame.
For more understanding see below example: -
df = spark.read.csv(r'C:\Users\asaini28.EAD\Desktop\TT.csv',inferSchema =True, header=True).na.drop()
w =Window.partitionBy('DEPARTMENT').orderBy('SALARY').rangeBetween(Window.unboundedPreceding,Window.currentRow)
df.withColumn('RangeBetween', F.sum(df.SALARY).over(w)).select('first_name','Department','Salary','Test').show()
first_name|Department|Salary|RangeBetween|
Sofia| Sales| 20000| 20000|
Gordon| Sales| 25000| 95000|
Gracie| Sales| 25000| 95000|
Cellie| Sales| 25000| 95000|
Jervis| Sales| 30000|125000|
Akash| Analysis| 30000| 30000|
Richard| Account| 12000| 12000|
Joelly| Account| 15000| 42000|
Carmiae| Account| 15000| 42000|
Bob| Account| 20000| 62000|
Gally| Account| 28000| 90000|
The Java spark docs add clarity: https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/expressions/WindowSpec.html#rowsBetween-long-long-
rangeBetween
A range-based boundary is based on the actual value of the ORDER BY expression(s). An offset is used to alter the value of the ORDER BY expression, for instance if the current order by expression has a value of 10 and the lower bound offset is -3, the resulting lower bound for the current row will be 10 - 3 = 7. This however puts a number of constraints on the ORDER BY expressions: there can be only one expression and this expression must have a numerical data type. An exception can be made when the offset is unbounded, because no value modification is needed, in this case multiple and non-numeric ORDER BY expression are allowed.
rowBetween
A row based boundary is based on the position of the row within the partition. An offset indicates the number of rows above or below the current row, the frame for the current row starts or ends. For instance, given a row based sliding frame with a lower bound offset of -1 and a upper bound offset of +2. The frame for row with index 5 would range from index 4 to index 6.
RANGEbetween looks at the ORDER BY clause to determine if a row is included in a window.
ROWSbetween looks at the order of the rows.
RANGE between checks if the ORDER BY is within some specified range, and will include them in a window.
ROWSbetween will form your window based on what rows are positionally around your current row (irregardless of the value of the ORDER BY column for those rows)
It is simple:
ROWS BETWEEN doesn't care about the exact values. It cares only about the order of rows, and takes fixed number of preceding and following rows when computing frame.RANGE BETWEEN considers values when computing frame.Let's use an example using two window definitions:
ORDER BY x ROWS BETWEEN 2 PRECEDING AND CURRENT ROWORDER BY x RANGE BETWEEN 2 PRECEDING AND CURRENT ROWand data as
+---+
| x|
+---+
| 10|
| 20|
| 30|
| 31|
+---+
Assuming the current row is the one with value 31 for the first window following rows will be included (current one, and two preceding):
+---+----------------------------------------------------+
| x|ORDER BY x ROWS BETWEEN 2 PRECEDING AND CURRENT ROW|
+---+----------------------------------------------------+
| 10| false|
| 20| true|
| 30| true|
| 31| true|
+---+----------------------------------------------------+
and for the second one following (current one, and all preceding where x >= 31 - 2):
+---+-----------------------------------------------------+
| x|ORDER BY x RANGE BETWEEN 2 PRECEDING AND CURRENT ROW|
+---+-----------------------------------------------------+
| 10| false|
| 20| false|
| 30| true|
| 31| true|
+---+-----------------------------------------------------+