Trying to collect data from local files using BeautifulSoup

问题

I want to run a python script to parse html files and collect a list of all the links with a target="_blank" attribute.

I've tried the following but it's not getting anything from bs4. SoupStrainer says in the docs it'll take args in the same way as findAll etc, should this work? Am I missing some stupid error?

import os
import sys

from bs4 import BeautifulSoup, SoupStrainer
from unipath import Path

def main():

    ROOT = Path(os.path.realpath(__file__)).ancestor(3)
    src = ROOT.child("src")
    templatedir = src.child("templates")

    for (dirpath, dirs, files) in os.walk(templatedir):
        for path in (Path(dirpath, f) for f in files):
            if path.endswith(".html"):
                for link in BeautifulSoup(path, parse_only=SoupStrainer(target="_blank")):
                    print link

if __name__ == "__main__":
    sys.exit(main())

回答1:

I think you need something like this

if path.endswith(".html"):
    htmlfile = open(dirpath)
    for link in BeautifulSoup(htmlfile,parse_only=SoupStrainer(target="_blank")):
        print link

回答2:

The usage BeautifulSoup is OK but you should pass in the html string, not just the path of the html file. BeautifulSoup accepts the html string as argument, not the file path. It will not open it and then read the content automatically. You should do it yourself. If you pass in a.html, the soup will be <html><body><p>a.html</p></body></html>. This is not the content of the file. Surely there is no links. You should use BeautifulSoup(open(path).read(), ...).

edit:
It also accepts the file descriptor. BeautifulSoup(open(path), ...) is enough.

来源：https://stackoverflow.com/questions/17574119/trying-to-collect-data-from-local-files-using-beautifulsoup

标签

python

beautifulsoup