Problem: Check a listing of over 1000 urls and get the url return code (status_code).
The script I have works but very slow.
I am thinking there has to be a be
In this case your task is I/O bound and not processor bound - it takes longer for a website to reply than it does for your CPU to loop once through your script (not including the TCP request). What this means is that you wont get any speedup from doing this task in parallel (which is what multiprocessing
does). What you want is multi-threading. The way this is achieved is by using the little documented, perhaps poorly named, multiprocessing.dummy
:
import requests
from multiprocessing.dummy import Pool as ThreadPool
urls = ['https://www.python.org',
'https://www.python.org/about/']
def get_status(url):
r = requests.get(url)
return r.status_code
if __name__ == "__main__":
pool = ThreadPool(4) # Make the Pool of workers
results = pool.map(get_status, urls) #Open the urls in their own threads
pool.close() #close the pool and wait for the work to finish
pool.join()
See here for examples of multiprocessing vs multithreading in Python.
In checkurlconnection
function, parameter must be urls
not url
.
else, in the for loop, urls
will point to the global variable, which is not what you want.
import requests
from multiprocessing import Pool
with open("url10.txt") as f:
urls = f.read().splitlines()
def checkurlconnection(urls):
for url in urls:
url = 'http://'+url
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
if __name__ == "__main__":
p = Pool(processes=4)
result = p.map(checkurlconnection, urls)