data load time when using spark with oracle

问题

I am trying to load data from oracle to spark in juypter notebook.But each time I try to pot graph the time taken is huge. How do I make it faster?

query = "(select * from db.schema where lqtime between trunc(sysdate)-30 and trunc(sysdate) )"
%time df = sqlContext.read.format('jdbc').options(url="jdbc:oracle:thin:useradmin/pass12@//localhost:1521/aldb",dbtable=query,driver="oracle.jdbc.OracleDriver").load()

Now I group by node:

%time fo_node = df.select('NODE').groupBy('NODE').count().sort('count',ascending=False)
%time fo_node.show(10)

The load time is 4m or more each time I run this.

回答1:

Using hadoop or apache spark against relational database is an anti-pattern, as database receives too many connections at once and tries to respond to too many requests at once. The disk most certainly is overwhelmed - even with index and partitioning, it reads data for each partition at once. I bet you have HDD there and I'd say that's the reason why it's really slow.

To speed loading up you can try

to actually reduce number of partitions for loading. Later on you can reshuffle it and increase parallelism.
to select specific fields instead of *. Even if you need all, but one, it can make a difference.
if you run few actions on the Dataframe, it make sense to cache that. If you don't have enough memory on cluster, then use options to cache on disk, too.
to export everything from Oracle to disk and read it as a plain csv file, processing it as you do right now in a query on a cluster.

来源：https://stackoverflow.com/questions/38933456/data-load-time-when-using-spark-with-oracle

标签

python-2.7

apache-spark

pyspark

jupyter