发表新帖

发表新帖

How to use DataFrame Window expressions and withColumn and not to change partition?

后端未结

关注

 2  1404

故里飘歌 2021-01-25 07:08

For some reason I have to convert RDD to DataFrame, then do something with DataFrame.

My interface is RDD,so I have

2条回答

青春惊慌失措 (楼主)

2021-01-25 07:31
I was just reading about controlling the number of partitions when using groupBy aggregation, from https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-performance-tuning-groupBy-aggregation.html, it seems the same trick works with Window, in my code I'm defining a window like
```
windowSpec = Window \
    .partitionBy('colA', 'colB') \
    .orderBy('timeCol') \
    .rowsBetween(1, 1)
```
and then doing
```
next_event = F.lead('timeCol', 1).over(windowSpec)
```
and creating a dataframe via
```
df2 = df.withColumn('next_event', next_event)
```
and indeed, it has 200 partitions. But, if I do
```
df2 = df.repartition(10, 'colA', 'colB').withColumn('next_event', next_event)
```
it has 10!
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题