Should the DataFrame function groupBy be avoided?

问题

This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different?

I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).

Essentially, I am trying to do is:

# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)

However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.

回答1:

A DataFrame groupBy followed by an agg will not move the data around unnecessarily, see here for a good example. Hence, there is no need to avoid it.

When using the RDD API, the opposite is true. Here it is preferable to avoid groupByKey and use a reducebyKey or combineByKey where possible. Some situations, however, do require one to use groupByKey.

The normal way to do this type of operation with the DataFrame API is to use groupBy followed by an aggregation using agg. In your example case, you want to find the maximum value for a single column for each group, this can be achived by the max function:

from pyspark.sql import functions as F

joined_df.groupBy("A").agg(F.max("TotalValue").alias("MaxValue"))

In addition to max there are a multitude of functions that can be used in combination with agg, see here for all operations.

回答2:

The documentation is pretty all over the place.

There has been a lot of optimization work for dataframes. Dataframes has additional information about the structure of your data, which helps with this. I often find that many people recommend dataframes over RDDs due to "increased optimization."

There is a lot of heavy wizardry behind the scenes.

I recommend that you try "groupBy" on both RDDs and dataframes on large datasets and compare the results. Sometimes, you may need to just do it.

Also, for performance improvements, I suggest fiddling (through trial and error) with: