How to calculate cumulative sum using sqlContext

后端 未结 4 2137
予麋鹿
予麋鹿 2020-12-15 02:11

I know we can use Window function in pyspark to calculate cumulative sum. But Window is only supported in HiveContext and not in SQLContext. I need to use SQLContext as Hive

4条回答
  •  失恋的感觉
    2020-12-15 02:33

    Here is a simple example:

    import pyspark
    from pyspark.sql import window
    import pyspark.sql.functions as sf
    
    
    sc = pyspark.SparkContext(appName="test")
    sqlcontext = pyspark.SQLContext(sc)
    
    data = sqlcontext.createDataFrame([("Bob", "M", "Boston", 1, 20),
                                       ("Cam", "F", "Cambridge", 1, 25),
                                      ("Lin", "F", "Cambridge", 1, 25),
                                      ("Cat", "M", "Boston", 1, 20),
                                      ("Sara", "F", "Cambridge", 1, 15),
                                      ("Jeff", "M", "Cambridge", 1, 25),
                                      ("Bean", "M", "Cambridge", 1, 26),
                                      ("Dave", "M", "Cambridge", 1, 21),], 
                                     ["name", 'gender', "city", 'donation', "age"])
    
    
    data.show()
    

    gives output

    +----+------+---------+--------+---+
    |name|gender|     city|donation|age|
    +----+------+---------+--------+---+
    | Bob|     M|   Boston|       1| 20|
    | Cam|     F|Cambridge|       1| 25|
    | Lin|     F|Cambridge|       1| 25|
    | Cat|     M|   Boston|       1| 20|
    |Sara|     F|Cambridge|       1| 15|
    |Jeff|     M|Cambridge|       1| 25|
    |Bean|     M|Cambridge|       1| 26|
    |Dave|     M|Cambridge|       1| 21|
    +----+------+---------+--------+---+
    

    Define a window

    win_spec = (window.Window
                      .partitionBy(['gender', 'city'])
                      .rowsBetween(window.Window.unboundedPreceding, 0))
    

    # window.Window.unboundedPreceding -- first row of the group # .rowsBetween(..., 0) -- 0 refers to current row, if instead -2 specified then upto 2 rows before current row

    Now, here is a trap:

    temp = data.withColumn('cumsum',sum(data.donation).over(win_spec))
    

    with error :

    TypeErrorTraceback (most recent call last)
     in ()
    ----> 1 temp = data.withColumn('cumsum',sum(data.donation).over(win_spec))
    
    /Users/mupadhye/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.pyc in __iter__(self)
        238 
        239     def __iter__(self):
    --> 240         raise TypeError("Column is not iterable")
        241 
        242     # string methods
    
    TypeError: Column is not iterable
    

    This is due to using python's sum function instead of pyspark's. The way to fix this is using sum function from pyspark.sql.functions.sum:

    temp = data.withColumn('AgeSum',sf.sum(data.donation).over(win_spec))
    temp.show()
    

    will give:

    +----+------+---------+--------+---+--------------+
    |name|gender|     city|donation|age|CumSumDonation|
    +----+------+---------+--------+---+--------------+
    |Sara|     F|Cambridge|       1| 15|             1|
    | Cam|     F|Cambridge|       1| 25|             2|
    | Lin|     F|Cambridge|       1| 25|             3|
    | Bob|     M|   Boston|       1| 20|             1|
    | Cat|     M|   Boston|       1| 20|             2|
    |Dave|     M|Cambridge|       1| 21|             1|
    |Jeff|     M|Cambridge|       1| 25|             2|
    |Bean|     M|Cambridge|       1| 26|             3|
    +----+------+---------+--------+---+--------------+
    

提交回复
热议问题