Regex for links in html text

前端 未结 8 1672
旧巷少年郎
旧巷少年郎 2020-12-16 04:42

I hope this question is not a RTFM one. I am trying to write a Python script that extracts links from a standard HTML webpage (the tags). I hav

8条回答
  •  遥遥无期
    2020-12-16 05:32

    As others have suggested, if real-time-like performance isn't necessary, BeautifulSoup is a good solution:

    import urllib2
    from BeautifulSoup import BeautifulSoup
    
    html = urllib2.urlopen("http://www.google.com").read()
    soup = BeautifulSoup(html)
    all_links = soup.findAll("a")
    

    As for the second question, yes, HTML links ought to be well-defined, but the HTML you actually encounter is very unlikely to be standard. The beauty of BeautifulSoup is that it uses browser-like heuristics to try to parse the non-standard, malformed HTML that you are likely to actually come across.

    If you are certain to be working on standard XHTML, you can use (much) faster XML parsers like expat.

    Regex, for the reasons above (the parser must maintain state, and regex can't do that) will never be a general solution.

提交回复
热议问题