lxml incorrectly parsing the Doctype while looking for links

问题

I've got a BeautifulSoup4 (4.2.1) parser which collects all href attributes from our template files, and until now it has been just perfect. But with lxml installed, one of our guys is now getting a;

TypeError: string indices must be integers.

I managed to replicate this on my Linux Mint VM and the only difference appears to be lxml so I assume when bs4 uses that html parser the issue occurs.

The problem function is;

def collecttemplateurls(templatedir, urlslist):
    """
    Uses BeautifulSoup to extract all the external URLs from the templates dir.

    @return: list of URLs
    """
    for (dirpath, dirs, files) in os.walk(templatedir):
        for path in (Path(dirpath, f) for f in files):
            if path.endswith(".html"):
                for link in BeautifulSoup(
                        open(path).read(),
                        parse_only=SoupStrainer(target="_blank")
                ):
                    if link["href"].startswith('http://'):
                        urlslist.append(link['href'])

                    elif link["href"].startswith('{{'):
                        for l in re.findall("'(http://(?:.*?))'", link["href"]):
                            urlslist.append(l)

    return urlslist

So for this one guy, the line if link["href"].startswith('http://'): gives the Type error because BS4 thinks the html Doctype is a link.

Can anyone explain what the problem here might be because nobody else can recreate it?

I can't see how this could happen when using SoupStrainer like this. I assume it's somehow related to a system setup issue.

I can't see anything particularly special about our Doctype;

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">

<head>

回答1:

SoupStrainer will not filter out the document type; it filters what elements remain in document, but the doc-type is retained as it is part of the 'container' for the filtered elements. You are looping over all elements in the document, so the first element you encounter is the DocType object.

Use .find_all() on the 'strained' document:

document = BeautifulSoup(open(path).read(), parse_only=SoupStrainer(target="_blank"))
for link in documen.find_all(target="_blank"):

or filter out the DocType object:

from bs4 import DocType

for link in BeautifulSoup(
        open(path).read(),
        parse_only=SoupStrainer(target="_blank")
):
    if isinstance(link, Doctype): continue

来源：https://stackoverflow.com/questions/17988884/lxml-incorrectly-parsing-the-doctype-while-looking-for-links

标签

python

html

beautifulsoup

lxml