How to use multiprocessing to loop through a big list of URL?

前端 未结 2 1456
我寻月下人不归
我寻月下人不归 2020-12-16 08:36

Problem: Check a listing of over 1000 urls and get the url return code (status_code).

The script I have works but very slow.

I am thinking there has to be a be

相关标签:
2条回答
  • 2020-12-16 09:04

    In this case your task is I/O bound and not processor bound - it takes longer for a website to reply than it does for your CPU to loop once through your script (not including the TCP request). What this means is that you wont get any speedup from doing this task in parallel (which is what multiprocessing does). What you want is multi-threading. The way this is achieved is by using the little documented, perhaps poorly named, multiprocessing.dummy:

    import requests
    from multiprocessing.dummy import Pool as ThreadPool 
    
    urls = ['https://www.python.org',
            'https://www.python.org/about/']
    
    def get_status(url):
        r = requests.get(url)
        return r.status_code
    
    if __name__ == "__main__":
        pool = ThreadPool(4)  # Make the Pool of workers
        results = pool.map(get_status, urls) #Open the urls in their own threads
        pool.close() #close the pool and wait for the work to finish 
        pool.join() 
    

    See here for examples of multiprocessing vs multithreading in Python.

    0 讨论(0)
  • 2020-12-16 09:12

    In checkurlconnection function, parameter must be urls not url. else, in the for loop, urls will point to the global variable, which is not what you want.

    import requests
    from multiprocessing import Pool
    
    with open("url10.txt") as f:
        urls = f.read().splitlines()
    
    def checkurlconnection(urls):
        for url in urls:
            url =  'http://'+url
            try:
                resp = requests.get(url, timeout=1)
                print(len(resp.content), '->', resp.status_code, '->', resp.url)
            except Exception as e:
                print("Error", url)
    
    if __name__ == "__main__":
        p = Pool(processes=4)
        result = p.map(checkurlconnection, urls)
    
    0 讨论(0)
提交回复
热议问题