lxml incorrectly parsing the Doctype while looking for links

孤者浪人 提交于 2019-12-25 07:09:23

问题


I've got a BeautifulSoup4 (4.2.1) parser which collects all href attributes from our template files, and until now it has been just perfect. But with lxml installed, one of our guys is now getting a;

TypeError: string indices must be integers.

I managed to replicate this on my Linux Mint VM and the only difference appears to be lxml so I assume when bs4 uses that html parser the issue occurs.

The problem function is;

def collecttemplateurls(templatedir, urlslist):
    """
    Uses BeautifulSoup to extract all the external URLs from the templates dir.

    @return: list of URLs
    """
    for (dirpath, dirs, files) in os.walk(templatedir):
        for path in (Path(dirpath, f) for f in files):
            if path.endswith(".html"):
                for link in BeautifulSoup(
                        open(path).read(),
                        parse_only=SoupStrainer(target="_blank")
                ):
                    if link["href"].startswith('http://'):
                        urlslist.append(link['href'])

                    elif link["href"].startswith('{{'):
                        for l in re.findall("'(http://(?:.*?))'", link["href"]):
                            urlslist.append(l)

    return urlslist

So for this one guy, the line if link["href"].startswith('http://'): gives the Type error because BS4 thinks the html Doctype is a link.

Can anyone explain what the problem here might be because nobody else can recreate it?

I can't see how this could happen when using SoupStrainer like this. I assume it's somehow related to a system setup issue.

I can't see anything particularly special about our Doctype;

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">

<head>

回答1:


SoupStrainer will not filter out the document type; it filters what elements remain in document, but the doc-type is retained as it is part of the 'container' for the filtered elements. You are looping over all elements in the document, so the first element you encounter is the DocType object.

Use .find_all() on the 'strained' document:

document = BeautifulSoup(open(path).read(), parse_only=SoupStrainer(target="_blank"))
for link in documen.find_all(target="_blank"):

or filter out the DocType object:

from bs4 import DocType

for link in BeautifulSoup(
        open(path).read(),
        parse_only=SoupStrainer(target="_blank")
):
    if isinstance(link, Doctype): continue 


来源:https://stackoverflow.com/questions/17988884/lxml-incorrectly-parsing-the-doctype-while-looking-for-links

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!