How to measure the execution time of a query on Spark

后端 未结 5 1315
误落风尘
误落风尘 2020-12-01 19:16

I need to measure the execution time of query on Apache spark (Bluemix). What I tried:

import time

startTimeQuery = time.clock()
df = sqlContext.sql(query)
         


        
相关标签:
5条回答
  • 2020-12-01 19:37

    I use System.nanoTime wrapped around a helper function, like this -

    def time[A](f: => A) = {
      val s = System.nanoTime
      val ret = f
      println("time: "+(System.nanoTime-s)/1e6+"ms")
      ret
    }
    
    time {
      df = sqlContext.sql(query)
      df.show()
    }
    
    0 讨论(0)
  • 2020-12-01 19:37

    For those looking for / needing a python version
    (as pyspark google search leads to this post) :

    from time import time
    from datetime import timedelta
    
    class T():
        def __enter__(self):
            self.start = time()
        def __exit__(self, type, value, traceback):
            self.end = time()
            elapsed = self.end - self.start
            print(str(timedelta(seconds=elapsed)))
    

    Usage :

    with T():
        //spark code goes here
    

    As inspired by : https://blog.usejournal.com/how-to-create-your-own-timing-context-manager-in-python-a0e944b48cf8

    Proved useful when using console or whith notebooks (jupyter magic %%time an %timeit are limited to cell scope, which is inconvenient when you have shared objects across notebook context)

    0 讨论(0)
  • 2020-12-01 19:39

    Update: No, using time package is not the best way to measure execution time of Spark jobs. The most convenient and exact way I know of is to use the Spark History Server.

    On Bluemix, in your notebooks go to the "Paelette" on the right side. Choose the "Evironment" Panel and you will see a link to the Spark History Server, where you can investigate the performed Spark jobs including computation times.

    0 讨论(0)
  • 2020-12-01 19:41

    To do it in a spark-shell (Scala), you can use spark.time().

    See another response by me: https://stackoverflow.com/a/50289329/3397114

    df = sqlContext.sql(query)
    spark.time(df.show())
    

    The output would be:

    +----+----+
    |col1|col2|
    +----+----+
    |val1|val2|
    +----+----+
    Time taken: xxx ms
    

    Related: On Measuring Apache Spark Workload Metrics for Performance Troubleshooting.

    0 讨论(0)
  • 2020-12-01 19:53

    SPARK itself provides much granular information about each stage of your Spark Job.

    You can view your running job on http://IP-MasterNode:4040 or You can enable History server for analyzing the jobs at a later time.

    Refer here for more info on History server.

    0 讨论(0)
提交回复
热议问题