Converting mysql table to spark dataset is very slow compared to same from csv file

后端 未结 2 1879
谎友^
谎友^ 2020-12-01 15:20

I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;



        
2条回答
  •  暖寄归人
    2020-12-01 15:45

    This problem has been covered multiple times on StackOverflow:

    • How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
    • spark jdbc df limit... what is it doing?
    • How to use JDBC source to write and read data in (Py)Spark?

    and in external sources:

    • https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#parallelizing-reads

    so just to reiterate - by default DataFrameReader.jdbc doesn't distribute data or reads. It uses single thread, single exectuor.

    To distribute reads:

    • use ranges with lowerBound / upperBound:

      Properties properties;
      Lower
      
      Dataset set = sc
          .read()
          .option("partitionColumn", "foo")
          .option("numPartitions", "3")
          .option("lowerBound", 0)
          .option("upperBound", 30)
          .option("url", url)
          .option("dbtable", this.tableName)
          .option("driver","com.mysql.jdbc.Driver")
          .format("jdbc")
          .load();
      
    • predicates

      Properties properties;
      Dataset set = sc
          .read()
          .jdbc(
              url, this.tableName,
              {"foo < 10", "foo BETWWEN 10 and 20", "foo > 20"},
              properties
          )
      

提交回复
热议问题