问题
I have a couple of data transformations that it seems operate quite slowly while iterating.
What general strategies can I use to increase performance?
Input Data:
+-----------+-------+
| key | val |
+-----------+-------+
| a | 1 |
| a | 2 |
| b | 1 |
| b | 2 |
| b | 3 |
+-----------+-------+
My code I'm iterating on is the following:
from pyspark.sql import functions as F
# Output = /my/function/output
# input_df = /my/function/input
def my_compute_function(input_df):
"""Compute difference from maximum of a value column by key
Keyword arguments:
input_df (pyspark.sql.DataFrame) : input DataFrame
Returns:
pyspark.sql.DataFrame
"""
max_df = input_df \
.groupBy("key") \
.agg(F.max(F.col("val").alias("max_val")
joined_df = input_df \
.join(max_df, "key")
diff_df = joined_df \
.withColumn("diff", F.col("max_val") - F.col("val"))
return diff_df
It took me 4 build iterations to get max_df right, 4 to get joined_df right, and 4 to get diff_df right.
This represents total work of:
pipeline_1:
transform_A:
work_1 : input -> max_df
(takes 4 iterations to get right): 4 * max_df
work_2: max_df -> joined_df
(takes 4 iterations to get right): 4 * joined_df + 4 * max_df
= 4 joined_df + 4 max_df
work_3: joined_df -> diff_df
(takes 4 iterations to get right): 4 * diff_df + 4 * joined_df + 4 * max_df
total work:
transform_A
= work_1 + work_2 + work_3
= 4 * max_df + (4 * joined_df + 4 * max_df) + (4 * diff_df + 4 * joined_df + 4 * max_df)
= 12 * max_df + 8 * joined_df + 4 * diff_df
Output data:
+-----------+-------+--------+
| key | val | diff |
+-----------+-------+--------+
| a | 1 | 1 |
| a | 2 | 0 |
| b | 1 | 2 |
| b | 2 | 1 |
| b | 3 | 0 |
+-----------+-------+--------+
回答1:
Refactoring
For experimentation / fast iteration, it's often a good idea to refactor your code into several smaller steps instead of a single large step.
This way, you compute the upstream cells first, write the data back to Foundry, and use this pre-computed data in later steps. If you were to keep re-computing without changing these early steps' logic, you are doing nothing but extra work again and again.
Concretely:
from pyspark.sql import functions as F
# output = /my/function/output_max
# input_df = "/my/function/input
def my_compute_function(input_df):
"""Compute the max by key
Keyword arguments:
input_df (pyspark.sql.DataFrame) : input DataFrame
Returns:
pyspark.sql.DataFrame
"""
max_df = input_df \
.groupBy("key") \
.agg(F.max(F.col("val").alias("max_val")
return max_df
# output = /my/function/output_joined
# input_df = /my/function/input
# max_df = /my/function/output_max
def my_compute_function(max_df, input_df):
"""Compute the joined output of max and input
Keyword arguments:
max_df (pyspark.sql.DataFrame) : input DataFrame
input_df (pyspark.sql.DataFrame) : input DataFrame
Returns:
pyspark.sql.DataFrame
"""
joined_df = input_df \
.join(max_df, "key")
return joined_df
# Output = /my/function/output_diff
# joined_df = /my/function/output_joined
def my_compute_function(joined_df):
"""Compute difference from maximum of a value column by key
Keyword arguments:
joined_df (pyspark.sql.DataFrame) : input DataFrame
Returns:
pyspark.sql.DataFrame
"""
diff_df = joined_df \
.withColumn("diff", F.col("max_val") - F.col("val"))
return diff_df
The work you perform would instead look like:
pipeline_2:
transform_A:
work_1: input -> max_df
(takes 4 iterations to get right): 4 * max_df
transform_B:
work_2: max_df -> joined_df
(takes 4 iterations to get right): 4 * joined_df
transform_C:
work:3: joined_df -> diff_df
(takes 4 iterations to get right): 4 * diff_df
total_work:
transform_A + transform_B + transform_C
= work_1 + work_2 + work_3
= 4 * max_df + 4 * joined_df + 4 * diff_df
If you assume max_df, joined_df, and diff_df all cost the same amount to compute, pipeline_1.total_work = 24 * max_df, whereas pipeline_2.total_work = 12 * max_df so you can expect something on the order of 2x speed improvement on iteration.
Caching
For any 'small' datasets, you should cache them. This will keep the rows in-memory for your pipeline and not require fetching from the written-back dataset. 'small' is somewhat arbitrary given a lot of different factors that must be considered, but Spark does a good job of trying to cache it no matter what and warning you if it's too big.
In this case, depending on your setup, you could cache the intermediate layers of max_df and joined_df depending on which step you are developing.
Function Calls
You should stick to native PySpark methods as much as possible and never user Python methods directly on data (i.e. looping over individual rows, executing a UDF). PySpark methods call the underlying Spark methods that are written in Scala and run directly against the data instead of the Python runtime, and if you simply use Python as the layer to interact with this system instead of being the system that interacts with the data, you will get all the performance benefits of Spark itself.
In the above example, only native PySpark methods are called, so this computation will be quite fast.
Downsampling
If you can derive your own accurate sample of a large input dataset, this can be used as the mock input for your transformations, until such time you perfect your logic and want to test it against the full set.
In the above case, we could downsample input_df to be a single key before executing any prior steps.
I personally down-sample and cache datasets above 1M rows before ever writing a line of PySpark code, that way my turnaround times are very fast and I don't ever catch syntax bugs slowly due to large dataset sizes.
All Together
A good development pipeline looks like:
- Discrete chunks of code that do particular materializations you expect to re-use later but don't need to be recomputed over and over again
- Downsampled to 'small' sizes
- Cached 'small' datasets for very fast fetching
- PySpark native code only that exploits the fast underlying Spark libraries
来源:https://stackoverflow.com/questions/59162016/how-do-i-decrease-iteration-time-when-making-data-transformations