Converting mysql table to spark dataset is very slow compared to same from csv file

后端未结

关注

 2  1879

谎友^ 2020-12-01 15:20

I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;

2条回答

暖寄归人 (楼主)

2020-12-01 15:45
This problem has been covered multiple times on StackOverflow:
- How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
- spark jdbc df limit... what is it doing?
- How to use JDBC source to write and read data in (Py)Spark?
and in external sources:
- https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#parallelizing-reads
so just to reiterate - by default DataFrameReader.jdbc doesn't distribute data or reads. It uses single thread, single exectuor.

To distribute reads:
- use ranges with lowerBound / upperBound:
```
Properties properties;
Lower

Dataset set = sc
    .read()
    .option("partitionColumn", "foo")
    .option("numPartitions", "3")
    .option("lowerBound", 0)
    .option("upperBound", 30)
    .option("url", url)
    .option("dbtable", this.tableName)
    .option("driver","com.mysql.jdbc.Driver")
    .format("jdbc")
    .load();
```
- predicates
```
Properties properties;
Dataset set = sc
    .read()
    .jdbc(
        url, this.tableName,
        {"foo < 10", "foo BETWWEN 10 and 20", "foo > 20"},
        properties
    )
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...