My script encounters an error when it is supposed to run asynchronously

偶尔善良 提交于 2019-12-11 07:53:32

问题


I've written a script in python using asyncio association with aiohttp library to parse the names out of pop up boxes initiated upon clicking on contact info buttons out of diffetent agency information located within a table from this website asynchronously. The webpage displayes the tabular contents across 513 pages.

I encountered this error too many file descriptors in select() when I tried with asyncio.get_event_loop() but when I came across this thread I could see that there is a suggestion to use asyncio.ProactorEventLoop() to avoid such error so I used the latter but noticed that, even when I complied with the suggestion, the script collects the names only from few pages until it throws the following error. How can i fix this?

raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.tursab.org.tr:443 ssl:None [The semaphore timeout period has expired]

This is my try so far with:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"

async def get_links(url):
    async with asyncio.Semaphore(10):
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
                result = await process_docs(text)
            return result

async def process_docs(html):
    coros = []
    soup = BeautifulSoup(html,"lxml")
    items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
    for item in items:
        coros.append(fetch_again(lead_link.format(item)))
    await asyncio.gather(*coros)

async def fetch_again(link):
    async with asyncio.Semaphore(10):
        async with aiohttp.ClientSession() as session:
            async with session.get(link) as response:
                text = await response.text()
                sauce = BeautifulSoup(text,"lxml")
                try:
                    name = sauce.select_one("p > b").text
                except Exception: name = ""
                print(name)

if __name__ == '__main__':
    loop = asyncio.ProactorEventLoop()
    asyncio.set_event_loop(loop)
    loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))

In short, What the process_docs() function does is collect data-id numbers from each pages to reuse them as the prefix of this https://www.tursab.org.tr/en/displayAcenta?AID={} link to collect the names from pop up boxes. One such id is 8757 and one such qualified links therefore https://www.tursab.org.tr/en/displayAcenta?AID=8757.

Btw, If I change the highest number used in the links variable to 20 or 30 or so, It goes smoothly.


回答1:


async def get_links(url):
    async with asyncio.Semaphore(10):

You can't do such a thing: it means that on each function call new semaphore instance will be created, while you need to single semaphore instance for all requests. Change your code this way:

sem = asyncio.Semaphore(10)  # module level

async def get_links(url):
    async with sem:
        # ...


async def fetch_again(link):
    async with sem:
        # ...

You can also return default loop once you're using semaphore correctly:

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(...)

Finally, you should alter both get_links(url) and fetch_again(link) to do parsing outside of semaphore to release it as soon as possible, before semaphore is needed inside process_docs(text).

Final code:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"

sem = asyncio.Semaphore(10)

async def get_links(url):
    async with sem:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
    result = await process_docs(text)
    return result

async def process_docs(html):
    coros = []
    soup = BeautifulSoup(html,"lxml")
    items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
    for item in items:
        coros.append(fetch_again(lead_link.format(item)))
    await asyncio.gather(*coros)

async def fetch_again(link):
    async with sem:
        async with aiohttp.ClientSession() as session:
            async with session.get(link) as response:
                text = await response.text()
    sauce = BeautifulSoup(text,"lxml")
    try:
        name = sauce.select_one("p > b").text
    except Exception:
        name = "o"
    print(name)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))


来源:https://stackoverflow.com/questions/53746967/my-script-encounters-an-error-when-it-is-supposed-to-run-asynchronously

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!