问题

I want to control the reading and writing speed to an RDB by Spark directly, yet the related parameters as the title already revealed seemingly were not working.

Can I conclude that fetchsize and batchsize didn't work with my testing method? Or they do affect on the facet of reading and writing since the measure result is reasonable based on scale.

Stats of `betchsize`, `fetchsize` and data set

/*Dataset*/
+--------------+-----------+
| Observations | Dataframe |
+--------------+-----------+
|      109,077 | Initial   |
|      345,732 | Ultimate  |
+--------------+-----------+
/*fetchsize*/
+-----------+-----------+------------------+------------------+
| fetchsize | batchsize | Reading Time(ms) | Writing Time(ms) |
+-----------+-----------+------------------+------------------+
|        10 |        10 |            2,103 |           38,428 |
|       100 |        10 |            2,123 |           38,021 |
|     1,000 |        10 |            2,032 |           38,345 |
|    10,000 |        10 |            2,016 |           37,892 |
|    50,000 |        10 |            2,017 |           37,795 |
|   100,000 |        10 |            2,055 |           38,720 |
+-----------+-----------+------------------+------------------+
/*batchsize*/
+-----------+-----------+------------------+------------------+
| fetchsize | batchsize | Reading Time(ms) | Writing Time(ms) |
+-----------+-----------+------------------+------------------+
|        10 |        10 |            2,072 |           37,977 |
|        10 |       100 |            2,077 |           36,990 |
|        10 |     1,000 |            2,034 |           36,703 |
|        10 |    10,000 |            1,979 |           36,980 |
|        10 |    50,000 |            2,043 |           36,749 |
|        10 |   100,000 |            2,005 |           36,624 |
+-----------+-----------+------------------+------------------+

Measures watched from Datadog

Detail that may be helpful

I created two m4.xlarge Linux entities on AWS, one is for the execution of Spark, the other is for data storage on an RDB, using Datadog to watch the performance of the Spark application, especially on the reading and writing to the RDB. Spark was in the standalone mode, and the application for test is simply pulling some data from a MySQL RDB, doing some computation, then pushing back to the MySQL.

Some details are as the following:

JDBC properties are put in a file, application.conf, like the following:

spark {
  Reading {
    url: "jdbc:mysql://address/designated database"
    driver: "com.mysql.cj.jdbc.Driver"
    user: "username"
    password: "password"
    fetchsize: "10000"
  }
  Writing {
    url: "jdbc:mysql://address/designated database"
    driver: "com.mysql.cj.jdbc.Driver"
    dbtable: "designated table"
    user: "username"
    password: "password"
    batchsize: "10000"
    truncate: "true"
  }
}

Logging while executing the application is enabled by log4jx2, within it, time for writing is measured.

                .
                .
                .
startTime = System.nanoTime()
val connection = new Properties()
configureProperties(connection, conf, "spark.Writing")
val ultimateObservations = ultimateResult.count()
ultimateResult.write
    .mode(SaveMode.Overwrite)
    .jdbc(conf.getString("spark.Writing.url"),
          conf.getString("spark.Writing.dbtable"),
          connection)
finishedTime = System.nanoTime()
logger.info("Finished writing from Spark to MySQL, taking {} milliseconds; approximately {} rows/s",
      TimeUnit.MILLISECONDS.convert((finishedTime - startTime), TimeUnit.NANOSECONDS),
      ultimateObservations/TimeUnit.SECONDS.convert((finishedTime - startTime), TimeUnit.NANOSECONDS)
    )
                .
                .
                .

/*
 *configureProperties is a customized function
 */
def configureProperties(connectionEntity: Properties, conf: Config, designatedString: String): Unit = {
    val propertiesCarrier = conf.getConfig(designatedString)
    for (entry <- propertiesCarrier.entrySet) {
      if (entry.getKey().trim() != "url" && entry.getKey().trim() != "dbtable") {
        connectionEntity.put(entry.getKey(), entry.getValue().unwrapped().toString())
        logger.info("Database configuration: ({}, {}).",
  entry.getKey(), entry.getValue().unwrapped().toString: Any)
    }
  }
}

来源：https://stackoverflow.com/questions/45589632/effect-of-fetchsize-and-batchsize-on-spark

标签

database