From the PySpark docs rangeBetween:
rangeBetween(start, end)Defines the frame boundaries, from start (inclusive) to end (inc
rowsBetween: - With rowsBetween you define a boundary frame of rows to calculate, which frame is calculated independently.
Frame in rowsBetween does not depend on orderBy clause.
df = spark.read.csv(r'C:\Users\akashSaini\Desktop\TT.csv',inferSchema =True, header=True).na.drop()
w =Window.partitionBy('DEPARTMENT').orderBy('SALARY').rowsBetween(Window.unboundedPreceding,Window.currentRow)
df.withColumn('RowsBetween', F.sum(df.SALARY).over(w)).show()
first_name|Department|Salary|RowsBetween|
Sofia| Sales| 20000| 20000|
Gordon| Sales| 25000| 45000|
Gracie| Sales| 25000| 70000|
Cellie| Sales| 25000| 95000|
Jervis| Sales| 30000|125000|
Akash| Analysis| 30000| 30000|
Richard| Account| 12000| 12000|
Joelly| Account| 15000| 27000|
Carmiae| Account| 15000| 42000|
Bob| Account| 20000| 62000|
Gally| Account| 28000| 90000
rangeBetween: - With rangeBetween, you define a boundary frame of rows to calculate, which may change.
Frame in rowsBetween depends on orderBy clause. rangeBetween will include all the rows which has same value in orderBy clause like Gordon, Gracie and Cellie have same salary so included with the current frame.
For more understanding see below example: -
df = spark.read.csv(r'C:\Users\asaini28.EAD\Desktop\TT.csv',inferSchema =True, header=True).na.drop()
w =Window.partitionBy('DEPARTMENT').orderBy('SALARY').rangeBetween(Window.unboundedPreceding,Window.currentRow)
df.withColumn('RangeBetween', F.sum(df.SALARY).over(w)).select('first_name','Department','Salary','Test').show()
first_name|Department|Salary|RangeBetween|
Sofia| Sales| 20000| 20000|
Gordon| Sales| 25000| 95000|
Gracie| Sales| 25000| 95000|
Cellie| Sales| 25000| 95000|
Jervis| Sales| 30000|125000|
Akash| Analysis| 30000| 30000|
Richard| Account| 12000| 12000|
Joelly| Account| 15000| 42000|
Carmiae| Account| 15000| 42000|
Bob| Account| 20000| 62000|
Gally| Account| 28000| 90000|