I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;
This problem has been covered multiple times on StackOverflow:
and in external sources:
so just to reiterate - by default DataFrameReader.jdbc doesn't distribute data or reads. It uses single thread, single exectuor.
To distribute reads:
use ranges with lowerBound / upperBound:
Properties properties;
Lower
Dataset set = sc
.read()
.option("partitionColumn", "foo")
.option("numPartitions", "3")
.option("lowerBound", 0)
.option("upperBound", 30)
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
predicates
Properties properties;
Dataset set = sc
.read()
.jdbc(
url, this.tableName,
{"foo < 10", "foo BETWWEN 10 and 20", "foo > 20"},
properties
)