How to use DataFrame Window expressions and withColumn and not to change partition?

后端 未结 2 1404
故里飘歌
故里飘歌 2021-01-25 07:08

For some reason I have to convert RDD to DataFrame, then do something with DataFrame.

My interface is RDD,so I have

2条回答
  •  青春惊慌失措
    2021-01-25 07:31

    I was just reading about controlling the number of partitions when using groupBy aggregation, from https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-performance-tuning-groupBy-aggregation.html, it seems the same trick works with Window, in my code I'm defining a window like

    windowSpec = Window \
        .partitionBy('colA', 'colB') \
        .orderBy('timeCol') \
        .rowsBetween(1, 1)
    

    and then doing

    next_event = F.lead('timeCol', 1).over(windowSpec)
    

    and creating a dataframe via

    df2 = df.withColumn('next_event', next_event)
    

    and indeed, it has 200 partitions. But, if I do

    df2 = df.repartition(10, 'colA', 'colB').withColumn('next_event', next_event)
    

    it has 10!

提交回复
热议问题