How to measure the execution time of a query on Spark

别来无恙 提交于 2019-12-17 09:51:52

问题


I need to measure the execution time of query on Apache spark (Bluemix). What I tried:

import time

startTimeQuery = time.clock()
df = sqlContext.sql(query)
df.show()
endTimeQuery = time.clock()
runTimeQuery = endTimeQuery - startTimeQuery

Is it a good way? The time that I get looks too small relative to when I see the table.


回答1:


Update: No, using time package is not the best way to measure execution time of Spark jobs. The most convenient and exact way I know of is to use the Spark History Server.

On Bluemix, in your notebooks go to the "Paelette" on the right side. Choose the "Evironment" Panel and you will see a link to the Spark History Server, where you can investigate the performed Spark jobs including computation times.




回答2:


To do it in the commandline, you can use spark.time().

See another response by me: https://stackoverflow.com/a/50289329/3397114

df = sqlContext.sql(query)
spark.time(df.show())

The output would be:

+----+----+
|col1|col2|
+----+----+
|val1|val2|
+----+----+
Time taken: xxx ms

Related: On Measuring Apache Spark Workload Metrics for Performance Troubleshooting.




回答3:


I use System.nanoTime wrapped around a helper function, like this -

def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}

time {
  df = sqlContext.sql(query)
  df.show()
}



回答4:


SPARK itself provides much granular information about each stage of your Spark Job.

You can view your running job on http://IP-MasterNode:4040 or You can enable History server for analyzing the jobs at a later time.

Refer here for more info on History server.



来源:https://stackoverflow.com/questions/34629313/how-to-measure-the-execution-time-of-a-query-on-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!