问题
I want to run a python script to parse html files and collect a list of all the links with a target="_blank" attribute.
I've tried the following but it's not getting anything from bs4. SoupStrainer says in the docs it'll take args in the same way as findAll etc, should this work? Am I missing some stupid error?
import os
import sys
from bs4 import BeautifulSoup, SoupStrainer
from unipath import Path
def main():
ROOT = Path(os.path.realpath(__file__)).ancestor(3)
src = ROOT.child("src")
templatedir = src.child("templates")
for (dirpath, dirs, files) in os.walk(templatedir):
for path in (Path(dirpath, f) for f in files):
if path.endswith(".html"):
for link in BeautifulSoup(path, parse_only=SoupStrainer(target="_blank")):
print link
if __name__ == "__main__":
sys.exit(main())
回答1:
I think you need something like this
if path.endswith(".html"):
htmlfile = open(dirpath)
for link in BeautifulSoup(htmlfile,parse_only=SoupStrainer(target="_blank")):
print link
回答2:
The usage BeautifulSoup is OK but you should pass in the html string, not just the path of the html file. BeautifulSoup accepts the html string as argument, not the file path. It will not open it and then read the content automatically. You should do it yourself. If you pass in a.html, the soup will be <html><body><p>a.html</p></body></html>. This is not the content of the file. Surely there is no links. You should use BeautifulSoup(open(path).read(), ...).
edit:
It also accepts the file descriptor. BeautifulSoup(open(path), ...) is enough.
来源:https://stackoverflow.com/questions/17574119/trying-to-collect-data-from-local-files-using-beautifulsoup