How to use multiprocessing to loop through a big list of URL?

前端未结

关注

 2  1456

我寻月下人不归

Problem: Check a listing of over 1000 urls and get the url return code (status_code).

The script I have works but very slow.

I am thinking there has to be a be

相关标签:

2条回答

执笔经年

2020-12-16 09:04
In this case your task is I/O bound and not processor bound - it takes longer for a website to reply than it does for your CPU to loop once through your script (not including the TCP request). What this means is that you wont get any speedup from doing this task in parallel (which is what multiprocessing does). What you want is multi-threading. The way this is achieved is by using the little documented, perhaps poorly named, multiprocessing.dummy:
```
import requests
from multiprocessing.dummy import Pool as ThreadPool 

urls = ['https://www.python.org',
        'https://www.python.org/about/']

def get_status(url):
    r = requests.get(url)
    return r.status_code

if __name__ == "__main__":
    pool = ThreadPool(4)  # Make the Pool of workers
    results = pool.map(get_status, urls) #Open the urls in their own threads
    pool.close() #close the pool and wait for the work to finish 
    pool.join() 
```
See here for examples of multiprocessing vs multithreading in Python.
0 讨论(0)
发布评论:

提交评论
- 加载中...

我在风中等你

2020-12-16 09:12

In checkurlconnection function, parameter must be urls not url. else, in the for loop, urls will point to the global variable, which is not what you want.

import requests
from multiprocessing import Pool

with open("url10.txt") as f:
    urls = f.read().splitlines()

def checkurlconnection(urls):
    for url in urls:
        url =  'http://'+url
        try:
            resp = requests.get(url, timeout=1)
            print(len(resp.content), '->', resp.status_code, '->', resp.url)
        except Exception as e:
            print("Error", url)

if __name__ == "__main__":
    p = Pool(processes=4)
    result = p.map(checkurlconnection, urls)

0 讨论(0)