PySpark Dataframe Performance Tuning

问题

I am trying to consolidate some scripts; to give us one read of the DB rather than every script reading the same data from Hive. So moving to a read-once; process many model.

I've persisted the dataframes & repartition the output after each aggregation; but I need it to be faster, if anything, those things have slowed it down. We have 20TB+ of data per day, so I had assumed that persisting the data, if it's going to be read many times, would make things faster, but it hasn't.

Also, I have lots of jobs that happen from the same data, like below. Can we run them in parallel. Can DF2 definition & output happen at the same time as the definition of DF3 to help speed it up?

df = definedf....persist()
df2 = df.groupby....
df3 = df.groupby....
....

Is it possible to define a globally cached dataframe that other scripts can call on?

Thanks a lot!

回答1:

In scala we can do like below. May be this code will help you to convert or think same logic in python.


scala> :paste
// Entering paste mode (ctrl-D to finish)

// Define all your parallel logics inside some classes like below

trait Common extends Product with Serializable {
    def process: DataFrame
}
case class A(df: DataFrame) extends Common{
  def process = {
      Thread.sleep(4000) // To show you, I have added sleep method
      println("Inside A case class")
      df.filter(col("id") <= 2)
  }
}

case class B(df: DataFrame) extends Common {
  def process = {
      Thread.sleep(1000) // To show you, I have added sleep method
      println("Inside B case class")
      df.filter(col("id") > 5 && col("id") <= 7)
  }
}

case class C(df: DataFrame) extends Common {
  def process = {
      Thread.sleep(3000) // To show you, I have added sleep method
      println("Inside C case class")
      df.filter(col("id") > 9 && col("id") <= 12)
  }
}

// Exiting paste mode, now interpreting.

defined trait Common
defined class A
defined class B
defined class C

scala> val df = (0 to 100).toDF("id").cache // Create & cache your DF.
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int]

scala> Seq(A(df),B(df),C(df)).par.map(_.process).reduce(_ union _).show(false) // Create All object in list which you want to invoke parallel

Inside B case class
Inside C case class
Inside A case class
+---+
|id |
+---+
|0  |
|1  |
|2  |
|6  |
|7  |
|10 |
|11 |
|12 |
+---+


scala>

回答2:

I've persisted the dataframes & repartition the output after each aggregation; but I need it to be faster, if anything, those things have slowed it down.

Repartition results in a data shuffle between the nodes in the cluster with corresponding performance cost.
Persisting the dataframe will mean it can be reused across Spark actions without being recomputed, so will usually be beneficial if your script contains multiple Spark actions. (Note that groupBy statements in your example are transformations not actions).
Default storage for persist is MEMORY_AND_DISK.

Also, I have lots of jobs that happen from the same data, like below. Can we run them in parallel.

The purpose of Spark is to make use of a cluster of machines to run jobs in a, distributed, parallel fashion. Each job runs in series and if Spark is correctly configured, and particularly if you program with dataframes, it will make optimal use of the cluster resources to compute the job as efficiently as possible. You won't typically get benefit from trying to layer your own parallelism on top. After all, two jobs running in parallel will be competing for the same resources.
Transformations such as groupBy will not be run in the order declared, but in the order that the dependent actions are declared.

Is it possible to define a globally cached dataframe that other scripts can call on?

persist will cache the dataframe so that it can be shared across jobs in the same Spark Session. Each Spark application has it's own Spark Session, so data is not shared between applications/scripts. Applications needing to share data do so via files.

回答3:

Persisting your DF doesn't guarantee that it actually gets persisted, it depends on the storage memory fraction you have on your worker nodes and if you just did .persist() then Spark will by default use MEMORY_ONLY storage configuration which says that it will cache the Dataframe to the amount you have in your Storage Memory fraction and the rest will be recomputed every time you will use it(perform any action on it).

I would suggest you to increase the memory on your worker nodes and if you didn't perform any intensive calculation then your can reduce the execution memory, also JVM takes lot of time to serialise and de-serialise so if there is so much data then you can use OFF- Heap memory (by-default is disabled) by setting spark.memory.offHeap.enabled property, off heap uses Spark Tungsten Format for storing data efficiently.

来源：https://stackoverflow.com/questions/61388397/pyspark-dataframe-performance-tuning

标签

apache-spark

pyspark