Python regex look-behind requires fixed-width pattern

前端 未结 5 1688
甜味超标
甜味超标 2020-12-19 03:28

When trying to extract the title of a html-page I have always used the following regex:

(?<=)([\\s\\S]*)(?=)
<
相关标签:
5条回答
  • 2020-12-19 03:55

    What about something like:

     r = re.compile("(<title.*>)([\s\S]*)(</title>)")
     title = r.search(page).group(2)
    
    0 讨论(0)
  • 2020-12-19 04:01

    If you just want to get the title tag,

    html=urllib2.urlopen("http://somewhere").read()
    for item in html.split("</title>"):
        if "<title>" in item:
            print item[ item.find("<title>")+7: ]
    
    0 讨论(0)
  • 2020-12-19 04:01

    Toss out the idea of parsing HTML with regular expressions and use an actual HTML parsing library instead. After a quick search I found this one. It's a much safer way to extract information from an HTML file.

    Remember, HTML is not a regular language so regular expressions are fundamentally the wrong tool for extracting information from it.

    0 讨论(0)
  • 2020-12-19 04:02

    The regex for extracting the content of non-nested HTML/XML tags is actually very simple:

    r = re.compile('<title[^>]*>(.*?)</title>')
    

    However, for anything more complex, you should really use a proper DOM parser like urllib or BeautifulSoup.

    0 讨论(0)
  • 2020-12-19 04:04

    Here's a famous answer on parsing html with regular expressions that does a great job of saying, "don't use regex to parse html."

    0 讨论(0)
提交回复
热议问题