问题
I want to use Apache Spark and connect to Vertica by JDBC.
In Vertica database, I have 100 million records and spark code runs on another server.
When I run the query in Spark and monitor network usage, traffic between two servers is very high.
It seems Spark loads all data from target server.
this is my code:
test_df = spark.read.format("jdbc")
.option("url" , url).option("dbtable", "my_table")
.option("user", "user").option("password" , "pass").load()
test_df.createOrReplaceTempView('tb')
data = spark.sql("select * from tb")
data.show()
when I run this, after 2 minutes and very high network usage, result returned.
Does Spark load the entire data from target database?
回答1:
JDBC
based DBs
allow push down queries so that you will read from the disk only relevant items: ex: df.filter("user_id == 2").count
will first select only records filtered and then ship count to spark. So using JDBC
: 1. plan for filters, 2. partition your DB according to your query patterns and further optimise form spark side as ex:
val prop = new java.util.Properties
prop.setProperty("driver","org.postgresql.Driver")
prop.setProperty("partitionColumn", "user_id")
prop.setProperty("lowerBound", "1")
prop.setProperty("upperBound", "272")
prop.setProperty("numPartitions", "30")
However, most relational DB
are partitioned by specific fields in a tree lke structure which is not ideal for complex big data queries: I strongly suggest to copy the table from JDBC
to no-sql
such as cassandra
, mongo
, elastic serach
or file systems such as alluxio
or hdfs
in order to enable scalable - parallel - complex - fast queries. Lastly, you can replace JDBC
with aws redshift
which should not be that hard to implement for backend / front end, however from your spark side it is a pain to deal with re dependency conflicts - but it will enable you to conduct complex queries much faster as it partition columns so you can have push down aggregates on columns themselves using multiple workers
.
回答2:
After your spark jobs finishes logon to the Vertica database using the same credentials that the spark job used and run:
SELECT * FROM v_monitor.query_requests ORDER BY start_timetamp DESC LIMIT 10000;
This will show you the queries sent to the database by the spark job, allowing you to see if it pushed down the count(*) to the database or if it indeed tried to retrieve the entire table across the network.
来源:https://stackoverflow.com/questions/42267390/does-apache-spark-load-entire-data-from-target-database