Now i am studying how to fetch data from website as fast as possible. To get faster speed, im considering using multi-thread. Here is the code i used to test the difference
In many cases, python's threading doesn't improve execution speed very well... sometimes, it makes it worse. For more information, see David Beazley's PyCon2010 presentation on the Global Interpreter Lock / Pycon2010 GIL slides. This presentation is very informative, I highly recommend it to anyone considering threading...
Even though David Beazley's talk explains that network traffic improves the scheduling of Python threading module, you should use the multiprocessing module. I included this as an option in your code (see bottom of my answer).
Running this on one of my older machines (Python 2.6.6):
current_post.mode == "Process" (multiprocessing) --> 0.2609 seconds
current_post.mode == "Multiple" (threading) --> 0.3947 seconds
current_post.mode == "Simple" (serial execution) --> 1.650 seconds
I agree with TokenMacGuy's comment and the numbers above include moving the .join()
to a different loop. As you can see, python's multiprocessing is significantly faster than threading.
from multiprocessing import Process
import threading
import time
import urllib
import urllib2
class Post:
def __init__(self, website, data, mode):
self.website = website
self.data = data
#mode is either:
# "Simple" (Simple POST)
# "Multiple" (Multi-thread POST)
# "Process" (Multiprocessing)
self.mode = mode
self.run_job()
def post(self):
#post data
req = urllib2.Request(self.website)
open_url = urllib2.urlopen(req, self.data)
if self.mode == "Multiple":
time.sleep(0.001)
#read HTMLData
HTMLData = open_url.read()
#print "OK"
def run_job(self):
"""This was refactored from the OP's code"""
origin_time = time.time()
if(self.mode == "Multiple"):
#multithreading POST
threads = list()
for i in range(0, 10):
thread = threading.Thread(target = self.post)
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
#calculate the time interval
time_interval = time.time() - origin_time
print "mode - {0}: {1}".format(method, time_interval)
if(self.mode == "Process"):
#multiprocessing POST
processes = list()
for i in range(0, 10):
process = Process(target=self.post)
process.start()
processes.append(process)
for process in processes:
process.join()
#calculate the time interval
time_interval = time.time() - origin_time
print "mode - {0}: {1}".format(method, time_interval)
if(self.mode == "Simple"):
#simple POST
for i in range(0, 10):
self.post()
#calculate the time interval
time_interval = time.time() - origin_time
print "mode - {0}: {1}".format(method, time_interval)
return time_interval
if __name__ == "__main__":
for method in ["Process", "Multiple", "Simple"]:
Post("http://forum.xda-developers.com/login.php",
"vb_login_username=test&vb_login_password&securitytoken=guest&do=login",
method
)
A DNS lookup takes time. There's nothing you can do about it. Large latencies are one reason to use multiple threads in the first place - multiple lookups ad site GET/POST can then happen in parallel.
Dump the sleep() - it's not helping.
The biggest thing you are doing wrong, that is hurting your throughput the most, is the way you are calling thread.start()
and thread.join()
:
for i in range(0, 10):
thread = threading.Thread(target = current_post.post)
thread.start()
thread.join()
Each time through the loop, you create a thread, start it, and then wait for it to finish Before moving on to the next thread. You aren't doing anything concurrently at all!
What you should probably be doing instead is:
threads = []
# start all of the threads
for i in range(0, 10):
thread = threading.Thread(target = current_post.post)
thread.start()
threads.append(thread)
# now wait for them all to finish
for thread in threads:
thread.join()
Keep in mind that the only case where multi-threading can "increase speed" in Python is when you have operations like this one that are heavily I/O bound. Otherwise multi-threading does not increase "speed" since it can not run on more than one CPU (no, not even if you have multiple cores, python doesn't work that way). You should use multi-threading when you want two things to be done at the same time, not when you want two things to be parallel (i.e. two processes running separately).
Now, what you're actually doing will not actually increase the speed of any single DNS lookup, but it will allow for multiple requests to be shot off while waiting for the results of some others, but you should be careful of how many you do or you will just make the response times even worse than they already are.
Also please stop using urllib2, and use Requests: http://docs.python-requests.org