问题
I've got a BeautifulSoup4 (4.2.1) parser which collects all href
attributes from our template files, and until now it has been just perfect. But with lxml installed, one of our guys is now getting a;
TypeError: string indices must be integers
.
I managed to replicate this on my Linux Mint VM and the only difference appears to be lxml so I assume when bs4 uses that html parser the issue occurs.
The problem function is;
def collecttemplateurls(templatedir, urlslist):
"""
Uses BeautifulSoup to extract all the external URLs from the templates dir.
@return: list of URLs
"""
for (dirpath, dirs, files) in os.walk(templatedir):
for path in (Path(dirpath, f) for f in files):
if path.endswith(".html"):
for link in BeautifulSoup(
open(path).read(),
parse_only=SoupStrainer(target="_blank")
):
if link["href"].startswith('http://'):
urlslist.append(link['href'])
elif link["href"].startswith('{{'):
for l in re.findall("'(http://(?:.*?))'", link["href"]):
urlslist.append(l)
return urlslist
So for this one guy, the line if link["href"].startswith('http://'):
gives the Type error because BS4 thinks the html Doctype is a link.
Can anyone explain what the problem here might be because nobody else can recreate it?
I can't see how this could happen when using SoupStrainer like this. I assume it's somehow related to a system setup issue.
I can't see anything particularly special about our Doctype;
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">
<head>
回答1:
SoupStrainer
will not filter out the document type; it filters what elements remain in document, but the doc-type is retained as it is part of the 'container' for the filtered elements. You are looping over all elements in the document, so the first element you encounter is the DocType
object.
Use .find_all()
on the 'strained' document:
document = BeautifulSoup(open(path).read(), parse_only=SoupStrainer(target="_blank"))
for link in documen.find_all(target="_blank"):
or filter out the DocType
object:
from bs4 import DocType
for link in BeautifulSoup(
open(path).read(),
parse_only=SoupStrainer(target="_blank")
):
if isinstance(link, Doctype): continue
来源:https://stackoverflow.com/questions/17988884/lxml-incorrectly-parsing-the-doctype-while-looking-for-links