How to get a faster speed when using multi-threading in python

后端未结

关注

 4  858

Now i am studying how to fetch data from website as fast as possible. To get faster speed, im considering using multi-thread. Here is the code i used to test the difference

相关标签:

4条回答

野性不改

2020-12-01 15:28

In many cases, python's threading doesn't improve execution speed very well... sometimes, it makes it worse. For more information, see David Beazley's PyCon2010 presentation on the Global Interpreter Lock / Pycon2010 GIL slides. This presentation is very informative, I highly recommend it to anyone considering threading...

Even though David Beazley's talk explains that network traffic improves the scheduling of Python threading module, you should use the multiprocessing module. I included this as an option in your code (see bottom of my answer).

Running this on one of my older machines (Python 2.6.6):

current_post.mode == "Process"  (multiprocessing)  --> 0.2609 seconds
current_post.mode == "Multiple" (threading)        --> 0.3947 seconds
current_post.mode == "Simple"   (serial execution) --> 1.650 seconds

I agree with TokenMacGuy's comment and the numbers above include moving the .join() to a different loop. As you can see, python's multiprocessing is significantly faster than threading.

from multiprocessing import Process
import threading
import time
import urllib
import urllib2


class Post:

    def __init__(self, website, data, mode):
        self.website = website
        self.data = data

        #mode is either:
        #   "Simple"      (Simple POST)
        #   "Multiple"    (Multi-thread POST)
        #   "Process"     (Multiprocessing)
        self.mode = mode
        self.run_job()

    def post(self):

        #post data
        req = urllib2.Request(self.website)
        open_url = urllib2.urlopen(req, self.data)

        if self.mode == "Multiple":
            time.sleep(0.001)

        #read HTMLData
        HTMLData = open_url.read()

        #print "OK"

    def run_job(self):
        """This was refactored from the OP's code"""
        origin_time = time.time()
        if(self.mode == "Multiple"):

            #multithreading POST
            threads = list()
            for i in range(0, 10):
               thread = threading.Thread(target = self.post)
               thread.start()
               threads.append(thread)
            for thread in threads:
               thread.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Process"):

            #multiprocessing POST
            processes = list()
            for i in range(0, 10):
               process = Process(target=self.post)
               process.start()
               processes.append(process)
            for process in processes:
               process.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Simple"):

            #simple POST
            for i in range(0, 10):
                self.post()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)
        return time_interval

if __name__ == "__main__":

    for method in ["Process", "Multiple", "Simple"]:
        Post("http://forum.xda-developers.com/login.php", 
            "vb_login_username=test&vb_login_password&securitytoken=guest&do=login",
            method
            )

0 讨论(0)

轻奢々

2020-12-01 15:34

A DNS lookup takes time. There's nothing you can do about it. Large latencies are one reason to use multiple threads in the first place - multiple lookups ad site GET/POST can then happen in parallel.

Dump the sleep() - it's not helping.

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-12-01 15:43
The biggest thing you are doing wrong, that is hurting your throughput the most, is the way you are calling thread.start() and thread.join():
```
for i in range(0, 10):
   thread = threading.Thread(target = current_post.post)
   thread.start()
   thread.join()
```
Each time through the loop, you create a thread, start it, and then wait for it to finish Before moving on to the next thread. You aren't doing anything concurrently at all!

What you should probably be doing instead is:
```
threads = []

# start all of the threads
for i in range(0, 10):
   thread = threading.Thread(target = current_post.post)
   thread.start()
   threads.append(thread)

# now wait for them all to finish
for thread in threads:
   thread.join()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤城傲影

2020-12-01 15:45

Keep in mind that the only case where multi-threading can "increase speed" in Python is when you have operations like this one that are heavily I/O bound. Otherwise multi-threading does not increase "speed" since it can not run on more than one CPU (no, not even if you have multiple cores, python doesn't work that way). You should use multi-threading when you want two things to be done at the same time, not when you want two things to be parallel (i.e. two processes running separately).

Now, what you're actually doing will not actually increase the speed of any single DNS lookup, but it will allow for multiple requests to be shot off while waiting for the results of some others, but you should be careful of how many you do or you will just make the response times even worse than they already are.

Also please stop using urllib2, and use Requests: http://docs.python-requests.org

0 讨论(0)
发布评论:

提交评论
- 加载中...