Extract part of a regex match

陌路散爱 提交于 2019-11-26 00:48:21

问题


I want a regular expression to extract the title from a HTML page. Currently I have this:

title = re.search(\'<title>.*</title>\', html, re.IGNORECASE).group()
if title:
    title = title.replace(\'<title>\', \'\').replace(\'</title>\', \'\') 

Is there a regular expression to extract just the contents of <title> so I don\'t have to remove the tags?


回答1:


Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)



回答2:


Note that starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it's possible to improve a bit on Krzysztof Krasoń's solution by capturing the match result directly within the if condition as a variable and re-use it in the condition's body:

# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
  title = match.group(1)
# hello



回答3:


Try using capturing groups:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)



回答4:


re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)




回答5:


The provided pieces of code do not cope with Exceptions May I suggest

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

This returns an empty string by default if the pattern has not been found, or the first match.




回答6:


Try:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)



回答7:


May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.

soup = BeatifulSoup(html_doc)
titleName = soup.title.name



回答8:


I'd think this should suffice:

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

... assuming that your text (HTML) is in a variable named "text."

This also assumes that there are not other HTML tags which can be legally embedded inside of an HTML TITLE tag and no way to legally embed any other < character within such a container/block.

However ...

Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra work when various HTML, SGML and XML parsers are already in the standard libraries.

If your handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn't in the standard libraries (yet) but is wide recommended for this purpose.

Another option is: lxml ... which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.



来源:https://stackoverflow.com/questions/1327369/extract-part-of-a-regex-match

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!