I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:
Your number is 123
val="Your number is <b>123</b>"
m=re.search(r'(<.*?>)(\d+)(<.*?>)',val)
m.group(2)
re.sub(r'([\s\S]+)(<.*?>)(\d+)(<.*?>)',r'\3',val)
Given s = "Your number is <b>123</b>"
then:
import re
m = re.search(r"\d+", s)
will work and give you
m.group()
'123'
The regular expression looks for 1 or more consecutive digits in your string.
Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of re.search()
to make sure that m
contained a valid reference, otherwise m.group()
would result in a AttributeError:
exception.
Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup - it's meant for that and much more. The whole idea with BeautifulSoup is to avoid "manual" parsing using string ops or regular expressions.
The simplest way is just extract digit(number)
re.search(r"\d+",text)