问题
Given:
- Pandas series
srcof strings; - Complex regular expression (for simplicity let
'^(?:\d+ (\w+)|(\w+) \d+)$') that can extract some single substring (let each string matches regex).
The goal: get pandas series (i.e. "column") that has extracted substrings from source series.
For example (note, this is simplified example - in the real case regular expression is more complex, i.e. it's difficult to rewrite it to regex with single capturing group):
src = pandas.Series(['1 s', '2 ss', 'sss 3', '4 ssss']) # source data
result = pandas.Series(['s', 'ss', 'sss', 'ssss']) # need to get
I tried the direct solution using str.extract:
result = src.str.extract('^(?:\d+ (\w+)|(\w+) \d+)$')
but it returns DataFrame with 2 columns where each row has NaN and required substring:
0 1
0 s NaN
1 ss NaN
2 NaN sss
3 ssss NaN
I tried use named capturing groups:
result = src.str.extract('^(?:\d+ (?P<field>\w+)|(?P<field>\w+) \d+)$')
but I got error:
sre_constants.error: redefinition of group name 'field' as group 2
I don't know how to solve this problem when I use alternation operator...
And the next question: how to solve the same problem when string in the series does not match regex? It's necessary to return NaN in such case.
UPDATE: I found solution using str.cat:
result = src.str.extract('^(?:\d+ (\w+)|(\w+) \d+)$')
result = result[0].str.cat(result[1], na_rep='')
But it requires additional actions... So I still find more elegant solution without changing number of capturing groups in the regex.
回答1:
Consider using a non-capturing group for the before/after part of the string, and just one capturing group. Make both non-capturing groups (before, after) optional. You won't detect erroneous lines this way, but it should get you what you need:
r"^(?:\d+\s+)?(\w+)(?:\s+\d+)?$"
Now, your true data is more complex. Consider adding a "looking at" zero-width assertion that defines the "correct" structure of your data. This will require the field be 'valid', according to whatever regex you already have. You have already done this work, you just need to put (?=...) around the regex, and convert capture groups to non-capture groups.
Next, separate all the alternate cases into before/capture/after sets. You have already done this work, you just have to organize the sets.
Now unify the before sets and the after sets with alternation, and a non-capturing group. Unify the "capture" sets with alternation and a capturing group. If possible, eliminate redundancy (two alternatives with the same pattern).
If your regex was something like:
r"A(B)C|D(E)F|G(H)I|J(K)L"
You convert that to a look-ahead pattern:
r"(?=A(?:B)C|D(?:E)F|G(?:H)I|J(?:K)L)"
You create a non-capturing "before" alternation:
r"(?:A|D|G|J)"
And a non-capturing "after" alternation:
r"(?:C|F|I|L)"
Finally, a capturing "capture" alternation:
r"(C|E|H|K)"
Put them all together:
r"(?=A(?:B)C|D(?:E)F|G(?:H)I|J(?:K)L)(?:A|D|G|J)(C|E|H|K)(?:C|F|I|L)"
It's ugly as can be, and you'll probably want to use embedded comments to document it, but it will work.
来源:https://stackoverflow.com/questions/35096360/extract-single-substring-from-each-row-of-the-series-using-regular-expression-wi