Trying to collect data from local files using BeautifulSoup

旧巷老猫 提交于 2019-12-24 11:58:41

问题


I want to run a python script to parse html files and collect a list of all the links with a target="_blank" attribute.

I've tried the following but it's not getting anything from bs4. SoupStrainer says in the docs it'll take args in the same way as findAll etc, should this work? Am I missing some stupid error?

import os
import sys

from bs4 import BeautifulSoup, SoupStrainer
from unipath import Path

def main():

    ROOT = Path(os.path.realpath(__file__)).ancestor(3)
    src = ROOT.child("src")
    templatedir = src.child("templates")

    for (dirpath, dirs, files) in os.walk(templatedir):
        for path in (Path(dirpath, f) for f in files):
            if path.endswith(".html"):
                for link in BeautifulSoup(path, parse_only=SoupStrainer(target="_blank")):
                    print link

if __name__ == "__main__":
    sys.exit(main())

回答1:


I think you need something like this

if path.endswith(".html"):
    htmlfile = open(dirpath)
    for link in BeautifulSoup(htmlfile,parse_only=SoupStrainer(target="_blank")):
        print link



回答2:


The usage BeautifulSoup is OK but you should pass in the html string, not just the path of the html file. BeautifulSoup accepts the html string as argument, not the file path. It will not open it and then read the content automatically. You should do it yourself. If you pass in a.html, the soup will be <html><body><p>a.html</p></body></html>. This is not the content of the file. Surely there is no links. You should use BeautifulSoup(open(path).read(), ...).

edit:
It also accepts the file descriptor. BeautifulSoup(open(path), ...) is enough.



来源:https://stackoverflow.com/questions/17574119/trying-to-collect-data-from-local-files-using-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!