data load time when using spark with oracle

拜拜、爱过 提交于 2019-12-25 06:23:12

问题


I am trying to load data from oracle to spark in juypter notebook.But each time I try to pot graph the time taken is huge. How do I make it faster?

query = "(select * from db.schema where lqtime between trunc(sysdate)-30 and trunc(sysdate) )"
%time df = sqlContext.read.format('jdbc').options(url="jdbc:oracle:thin:useradmin/pass12@//localhost:1521/aldb",dbtable=query,driver="oracle.jdbc.OracleDriver").load()

Now I group by node:

%time fo_node = df.select('NODE').groupBy('NODE').count().sort('count',ascending=False)
%time fo_node.show(10)

The load time is 4m or more each time I run this.


回答1:


Using hadoop or apache spark against relational database is an anti-pattern, as database receives too many connections at once and tries to respond to too many requests at once. The disk most certainly is overwhelmed - even with index and partitioning, it reads data for each partition at once. I bet you have HDD there and I'd say that's the reason why it's really slow.

To speed loading up you can try

  1. to actually reduce number of partitions for loading. Later on you can reshuffle it and increase parallelism.

  2. to select specific fields instead of *. Even if you need all, but one, it can make a difference.

  3. if you run few actions on the Dataframe, it make sense to cache that. If you don't have enough memory on cluster, then use options to cache on disk, too.

  4. to export everything from Oracle to disk and read it as a plain csv file, processing it as you do right now in a query on a cluster.



来源:https://stackoverflow.com/questions/38933456/data-load-time-when-using-spark-with-oracle

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!