问题

I have a couple of data transformations that it seems operate quite slowly while iterating.

What general strategies can I use to increase performance?

Input Data:

+-----------+-------+
| key       |   val |
+-----------+-------+
| a         |     1 |
| a         |     2 |
| b         |     1 |
| b         |     2 |
| b         |     3 |
+-----------+-------+

My code I'm iterating on is the following:

from pyspark.sql import functions as F


#  Output = /my/function/output
#  input_df = /my/function/input
def my_compute_function(input_df):
  """Compute difference from maximum of a value column by key

  Keyword arguments:
  input_df (pyspark.sql.DataFrame) : input DataFrame

  Returns:
  pyspark.sql.DataFrame
  """

  max_df = input_df \
    .groupBy("key") \
    .agg(F.max(F.col("val").alias("max_val")

  joined_df = input_df \
    .join(max_df, "key")

  diff_df = joined_df \
    .withColumn("diff", F.col("max_val") - F.col("val"))

  return diff_df

It took me 4 build iterations to get max_df right, 4 to get joined_df right, and 4 to get diff_df right.

This represents total work of:

pipeline_1:
  transform_A: 
    work_1 : input -> max_df 
      (takes 4 iterations to get right): 4 * max_df
    work_2: max_df -> joined_df 
      (takes 4 iterations to get right): 4 * joined_df + 4 * max_df 
        = 4 joined_df + 4 max_df
    work_3: joined_df -> diff_df 
      (takes 4 iterations to get right): 4 * diff_df + 4 * joined_df + 4 * max_df
  total work: 
    transform_A
     = work_1 + work_2 + work_3 
     = 4 * max_df + (4 * joined_df + 4 * max_df) + (4 * diff_df + 4 * joined_df + 4 * max_df)
     = 12 * max_df + 8 * joined_df + 4 * diff_df

Output data:

+-----------+-------+--------+
| key       |   val |   diff |
+-----------+-------+--------+
| a         |     1 |      1 |
| a         |     2 |      0 |
| b         |     1 |      2 |
| b         |     2 |      1 |
| b         |     3 |      0 |
+-----------+-------+--------+

回答1:

Refactoring

For experimentation / fast iteration, it's often a good idea to refactor your code into several smaller steps instead of a single large step.

This way, you compute the upstream cells first, write the data back to Foundry, and use this pre-computed data in later steps. If you were to keep re-computing without changing these early steps' logic, you are doing nothing but extra work again and again.

Concretely:


from pyspark.sql import functions as F


# output = /my/function/output_max
# input_df = "/my/function/input
def my_compute_function(input_df):
  """Compute the max by key

  Keyword arguments:
  input_df (pyspark.sql.DataFrame) : input DataFrame

  Returns:
  pyspark.sql.DataFrame
  """

  max_df = input_df \
    .groupBy("key") \
    .agg(F.max(F.col("val").alias("max_val")

  return max_df



# output = /my/function/output_joined
# input_df = /my/function/input
# max_df = /my/function/output_max
def my_compute_function(max_df, input_df):
  """Compute the joined output of max and input

  Keyword arguments:
  max_df (pyspark.sql.DataFrame) : input DataFrame
  input_df (pyspark.sql.DataFrame) : input DataFrame

  Returns:
  pyspark.sql.DataFrame
  """

  joined_df = input_df \
    .join(max_df, "key")

  return joined_df


#  Output = /my/function/output_diff
#  joined_df = /my/function/output_joined
def my_compute_function(joined_df):
  """Compute difference from maximum of a value column by key

  Keyword arguments:
  joined_df (pyspark.sql.DataFrame) : input DataFrame

  Returns:
  pyspark.sql.DataFrame
  """

  diff_df = joined_df \
    .withColumn("diff", F.col("max_val") - F.col("val"))

  return diff_df

The work you perform would instead look like:

pipeline_2: 
  transform_A: 
    work_1: input -> max_df
      (takes 4 iterations to get right): 4 * max_df
  transform_B: 
    work_2: max_df -> joined_df
      (takes 4 iterations to get right): 4 * joined_df
  transform_C:
    work:3: joined_df -> diff_df 
      (takes 4 iterations to get right): 4 * diff_df 
  total_work:
    transform_A + transform_B + transform_C
       = work_1 + work_2 + work_3 
       = 4 * max_df + 4 * joined_df + 4 * diff_df

If you assume max_df, joined_df, and diff_df all cost the same amount to compute, pipeline_1.total_work = 24 * max_df, whereas pipeline_2.total_work = 12 * max_df so you can expect something on the order of 2x speed improvement on iteration.

Caching

For any 'small' datasets, you should cache them. This will keep the rows in-memory for your pipeline and not require fetching from the written-back dataset. 'small' is somewhat arbitrary given a lot of different factors that must be considered, but Spark does a good job of trying to cache it no matter what and warning you if it's too big.

In this case, depending on your setup, you could cache the intermediate layers of max_df and joined_df depending on which step you are developing.

Function Calls

You should stick to native PySpark methods as much as possible and never user Python methods directly on data (i.e. looping over individual rows, executing a UDF). PySpark methods call the underlying Spark methods that are written in Scala and run directly against the data instead of the Python runtime, and if you simply use Python as the layer to interact with this system instead of being the system that interacts with the data, you will get all the performance benefits of Spark itself.

In the above example, only native PySpark methods are called, so this computation will be quite fast.

Downsampling

If you can derive your own accurate sample of a large input dataset, this can be used as the mock input for your transformations, until such time you perfect your logic and want to test it against the full set.

In the above case, we could downsample input_df to be a single key before executing any prior steps.

I personally down-sample and cache datasets above 1M rows before ever writing a line of PySpark code, that way my turnaround times are very fast and I don't ever catch syntax bugs slowly due to large dataset sizes.

All Together

A good development pipeline looks like:

Discrete chunks of code that do particular materializations you expect to re-use later but don't need to be recomputed over and over again
Downsampled to 'small' sizes
Cached 'small' datasets for very fast fetching
PySpark native code only that exploits the fast underlying Spark libraries

来源：https://stackoverflow.com/questions/59162016/how-do-i-decrease-iteration-time-when-making-data-transformations

标签

pyspark

How do I decrease iteration time when making data transformations?