asyncio web scraping 101: fetching multiple urls with aiohttp

前端 未结 2 1092
故里飘歌
故里飘歌 2020-12-08 00:59

In earlier question, one of authors of aiohttp kindly suggested way to fetch multiple urls with aiohttp using the new async with syntax from

相关标签:
2条回答
  • 2020-12-08 01:27

    I am far from an asyncio expert but you want to catch the error you need to catch a socket error:

    async def fetch(session, url):
        with aiohttp.Timeout(10):
            try:
                async with session.get(url) as response:
                    print(response.status == 200)
                    return await response.text()
            except socket.error as e:
                print(e.strerror)
    

    Running the code and printing the_results:

    Cannot connect to host sdfkhskhgklhskljhgsdfksjh.com:80 ssl:False [Can not connect to sdfkhskhgklhskljhgsdfksjh.com:80 [Name or service not known]]
    True
    True
    ({<Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!DOCTYPE ht...y>\n</html>\n'>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result=None>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!doctype ht.../body></html>'>}, set())
    

    You can see we get catch the error and the further calls are still successful returning the html.

    We should probably really be catching an OSError as socket.error is A deprecated alias of OSError since python 3.3:

    async def fetch(session, url):
        with aiohttp.Timeout(10):
            try:
                async with session.get(url) as response:
                    return await response.text()
            except OSError as e:
                print(e)
    

    If you want to also check the response is 200, put your if in the try too and you can use the reason attribute to get more info:

    async def fetch(session, url):
        with aiohttp.Timeout(10):
            try:
                async with session.get(url) as response:
                    if response.status != 200:
                        print(response.reason)
                    return await response.text()
            except OSError as e:
                print(e.strerror)
    
    0 讨论(0)
  • 2020-12-08 01:46

    I would use gather instead of wait, which can return exceptions as objects, without raising them. Then you can check each result, if it is instance of some exception.

    import aiohttp
    import asyncio
    
    async def fetch(session, url):
        with aiohttp.Timeout(10):
            async with session.get(url) as response:
                return await response.text()
    
    async def fetch_all(session, urls, loop):
        results = await asyncio.gather(
            *[fetch(session, url) for url in urls],
            return_exceptions=True  # default is false, that would raise
        )
    
        # for testing purposes only
        # gather returns results in the order of coros
        for idx, url in enumerate(urls):
            print('{}: {}'.format(url, 'ERR' if isinstance(results[idx], Exception) else 'OK'))
        return results
    
    if __name__ == '__main__':
        loop = asyncio.get_event_loop()
        # breaks because of the first url
        urls = [
            'http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
            'http://google.com',
            'http://twitter.com']
        with aiohttp.ClientSession(loop=loop) as session:
            the_results = loop.run_until_complete(
                fetch_all(session, urls, loop))
    

    Tests:

    $python test.py 
    http://SDFKHSKHGKLHSKLJHGSDFKSJH.com: ERR
    http://google.com: OK
    http://twitter.com: OK
    
    0 讨论(0)
提交回复
热议问题