Effect of fetchsize and batchsize on Spark

China☆狼群 提交于 2019-12-07 07:46:01

问题


I want to control the reading and writing speed to an RDB by Spark directly, yet the related parameters as the title already revealed seemingly were not working.

Can I conclude that fetchsize and batchsize didn't work with my testing method? Or they do affect on the facet of reading and writing since the measure result is reasonable based on scale.

Stats of betchsize, fetchsize and data set

/*Dataset*/
+--------------+-----------+
| Observations | Dataframe |
+--------------+-----------+
|      109,077 | Initial   |
|      345,732 | Ultimate  |
+--------------+-----------+
/*fetchsize*/
+-----------+-----------+------------------+------------------+
| fetchsize | batchsize | Reading Time(ms) | Writing Time(ms) |
+-----------+-----------+------------------+------------------+
|        10 |        10 |            2,103 |           38,428 |
|       100 |        10 |            2,123 |           38,021 |
|     1,000 |        10 |            2,032 |           38,345 |
|    10,000 |        10 |            2,016 |           37,892 |
|    50,000 |        10 |            2,017 |           37,795 |
|   100,000 |        10 |            2,055 |           38,720 |
+-----------+-----------+------------------+------------------+
/*batchsize*/
+-----------+-----------+------------------+------------------+
| fetchsize | batchsize | Reading Time(ms) | Writing Time(ms) |
+-----------+-----------+------------------+------------------+
|        10 |        10 |            2,072 |           37,977 |
|        10 |       100 |            2,077 |           36,990 |
|        10 |     1,000 |            2,034 |           36,703 |
|        10 |    10,000 |            1,979 |           36,980 |
|        10 |    50,000 |            2,043 |           36,749 |
|        10 |   100,000 |            2,005 |           36,624 |
+-----------+-----------+------------------+------------------+

Measures watched from Datadog

Detail that may be helpful

I created two m4.xlarge Linux entities on AWS, one is for the execution of Spark, the other is for data storage on an RDB, using Datadog to watch the performance of the Spark application, especially on the reading and writing to the RDB. Spark was in the standalone mode, and the application for test is simply pulling some data from a MySQL RDB, doing some computation, then pushing back to the MySQL.

Some details are as the following:

  1. JDBC properties are put in a file, application.conf, like the following:

    spark {
      Reading {
        url: "jdbc:mysql://address/designated database"
        driver: "com.mysql.cj.jdbc.Driver"
        user: "username"
        password: "password"
        fetchsize: "10000"
      }
      Writing {
        url: "jdbc:mysql://address/designated database"
        driver: "com.mysql.cj.jdbc.Driver"
        dbtable: "designated table"
        user: "username"
        password: "password"
        batchsize: "10000"
        truncate: "true"
      }
    }
    
  2. Logging while executing the application is enabled by log4jx2, within it, time for writing is measured.

                    .
                    .
                    .
    startTime = System.nanoTime()
    val connection = new Properties()
    configureProperties(connection, conf, "spark.Writing")
    val ultimateObservations = ultimateResult.count()
    ultimateResult.write
        .mode(SaveMode.Overwrite)
        .jdbc(conf.getString("spark.Writing.url"),
              conf.getString("spark.Writing.dbtable"),
              connection)
    finishedTime = System.nanoTime()
    logger.info("Finished writing from Spark to MySQL, taking {} milliseconds; approximately {} rows/s",
          TimeUnit.MILLISECONDS.convert((finishedTime - startTime), TimeUnit.NANOSECONDS),
          ultimateObservations/TimeUnit.SECONDS.convert((finishedTime - startTime), TimeUnit.NANOSECONDS)
        )
                    .
                    .
                    .
    
    /*
     *configureProperties is a customized function
     */
    def configureProperties(connectionEntity: Properties, conf: Config, designatedString: String): Unit = {
        val propertiesCarrier = conf.getConfig(designatedString)
        for (entry <- propertiesCarrier.entrySet) {
          if (entry.getKey().trim() != "url" && entry.getKey().trim() != "dbtable") {
            connectionEntity.put(entry.getKey(), entry.getValue().unwrapped().toString())
            logger.info("Database configuration: ({}, {}).",
      entry.getKey(), entry.getValue().unwrapped().toString: Any)
        }
      }
    }
    

来源:https://stackoverflow.com/questions/45589632/effect-of-fetchsize-and-batchsize-on-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!