How to get a faster speed when using multi-threading in python

后端 未结 4 851
天命终不由人
天命终不由人 2020-12-01 15:00

Now i am studying how to fetch data from website as fast as possible. To get faster speed, im considering using multi-thread. Here is the code i used to test the difference

相关标签:
4条回答
  • 2020-12-01 15:28

    In many cases, python's threading doesn't improve execution speed very well... sometimes, it makes it worse. For more information, see David Beazley's PyCon2010 presentation on the Global Interpreter Lock / Pycon2010 GIL slides. This presentation is very informative, I highly recommend it to anyone considering threading...

    Even though David Beazley's talk explains that network traffic improves the scheduling of Python threading module, you should use the multiprocessing module. I included this as an option in your code (see bottom of my answer).

    Running this on one of my older machines (Python 2.6.6):

    current_post.mode == "Process"  (multiprocessing)  --> 0.2609 seconds
    current_post.mode == "Multiple" (threading)        --> 0.3947 seconds
    current_post.mode == "Simple"   (serial execution) --> 1.650 seconds
    

    I agree with TokenMacGuy's comment and the numbers above include moving the .join() to a different loop. As you can see, python's multiprocessing is significantly faster than threading.


    from multiprocessing import Process
    import threading
    import time
    import urllib
    import urllib2
    
    
    class Post:
    
        def __init__(self, website, data, mode):
            self.website = website
            self.data = data
    
            #mode is either:
            #   "Simple"      (Simple POST)
            #   "Multiple"    (Multi-thread POST)
            #   "Process"     (Multiprocessing)
            self.mode = mode
            self.run_job()
    
        def post(self):
    
            #post data
            req = urllib2.Request(self.website)
            open_url = urllib2.urlopen(req, self.data)
    
            if self.mode == "Multiple":
                time.sleep(0.001)
    
            #read HTMLData
            HTMLData = open_url.read()
    
            #print "OK"
    
        def run_job(self):
            """This was refactored from the OP's code"""
            origin_time = time.time()
            if(self.mode == "Multiple"):
    
                #multithreading POST
                threads = list()
                for i in range(0, 10):
                   thread = threading.Thread(target = self.post)
                   thread.start()
                   threads.append(thread)
                for thread in threads:
                   thread.join()
                #calculate the time interval
                time_interval = time.time() - origin_time
                print "mode - {0}: {1}".format(method, time_interval)
    
            if(self.mode == "Process"):
    
                #multiprocessing POST
                processes = list()
                for i in range(0, 10):
                   process = Process(target=self.post)
                   process.start()
                   processes.append(process)
                for process in processes:
                   process.join()
                #calculate the time interval
                time_interval = time.time() - origin_time
                print "mode - {0}: {1}".format(method, time_interval)
    
            if(self.mode == "Simple"):
    
                #simple POST
                for i in range(0, 10):
                    self.post()
                #calculate the time interval
                time_interval = time.time() - origin_time
                print "mode - {0}: {1}".format(method, time_interval)
            return time_interval
    
    if __name__ == "__main__":
    
        for method in ["Process", "Multiple", "Simple"]:
            Post("http://forum.xda-developers.com/login.php", 
                "vb_login_username=test&vb_login_password&securitytoken=guest&do=login",
                method
                )
    
    0 讨论(0)
  • 2020-12-01 15:34

    A DNS lookup takes time. There's nothing you can do about it. Large latencies are one reason to use multiple threads in the first place - multiple lookups ad site GET/POST can then happen in parallel.

    Dump the sleep() - it's not helping.

    0 讨论(0)
  • 2020-12-01 15:43

    The biggest thing you are doing wrong, that is hurting your throughput the most, is the way you are calling thread.start() and thread.join():

    for i in range(0, 10):
       thread = threading.Thread(target = current_post.post)
       thread.start()
       thread.join()
    

    Each time through the loop, you create a thread, start it, and then wait for it to finish Before moving on to the next thread. You aren't doing anything concurrently at all!

    What you should probably be doing instead is:

    threads = []
    
    # start all of the threads
    for i in range(0, 10):
       thread = threading.Thread(target = current_post.post)
       thread.start()
       threads.append(thread)
    
    # now wait for them all to finish
    for thread in threads:
       thread.join()
    
    0 讨论(0)
  • 2020-12-01 15:45

    Keep in mind that the only case where multi-threading can "increase speed" in Python is when you have operations like this one that are heavily I/O bound. Otherwise multi-threading does not increase "speed" since it can not run on more than one CPU (no, not even if you have multiple cores, python doesn't work that way). You should use multi-threading when you want two things to be done at the same time, not when you want two things to be parallel (i.e. two processes running separately).

    Now, what you're actually doing will not actually increase the speed of any single DNS lookup, but it will allow for multiple requests to be shot off while waiting for the results of some others, but you should be careful of how many you do or you will just make the response times even worse than they already are.

    Also please stop using urllib2, and use Requests: http://docs.python-requests.org

    0 讨论(0)
提交回复
热议问题