问题
I am learning regex operation in pandas series string method. I was able to extract the first number from the string, but my regex is not matching the second number. How to capture both the numbers?
Note that second row, the second element is NAN here.
CODE:
import pandas as pd
df = pd.DataFrame({'a': ["number 1.23 has 1.2 ",
"number 12.2 has 12 "]})
pat = r""".+\s+
(\d+\.\d+)
.+
((?:\d+\.\d+)?)
.+"""
df['a'].str.extract(pat,flags=re.X,expand=True)
Gives:
0 1
1.23
12.2
Expected:
0 1
1.23 1.2
12.2 NaN
How to fix the regex?
I am very new to regex, so please be considerate and forgive my ignorance.
回答1:
You may use .str.findall with the \d+\.\d+
regex:
>>> df['a'].str.findall(r"\d+\.\d+").to_frame()
a
0 [1.23, 1.2]
1 [12.2]
Or,
>>> pd.DataFrame(df['a'].str.findall(r"\d+\.\d+").tolist())
0 1
0 1.23 1.2
1 12.2 None
The pattern matches
\d+
- 1+ digits\.
- dot\d+
- 1+ digits.
Note that str.findall
does not require the whole pattern to be wrapped with a capturing group, as is the case with .str.extractall
that could also be used here.
来源:https://stackoverflow.com/questions/56064849/pandas-string-extract-all-the-matches