Extract single substring from each row of the series using regular expression with named capturing groups in the alternation operator

。_饼干妹妹 提交于 2020-01-07 02:47:22

问题


Given:

  1. Pandas series src of strings;
  2. Complex regular expression (for simplicity let '^(?:\d+ (\w+)|(\w+) \d+)$') that can extract some single substring (let each string matches regex).

The goal: get pandas series (i.e. "column") that has extracted substrings from source series.

For example (note, this is simplified example - in the real case regular expression is more complex, i.e. it's difficult to rewrite it to regex with single capturing group):

src = pandas.Series(['1 s', '2 ss', 'sss 3', '4 ssss'])  # source data
result = pandas.Series(['s', 'ss', 'sss', 'ssss'])  # need to get

I tried the direct solution using str.extract:

result = src.str.extract('^(?:\d+ (\w+)|(\w+) \d+)$')

but it returns DataFrame with 2 columns where each row has NaN and required substring:

      0    1
0     s  NaN
1    ss  NaN
2   NaN  sss
3  ssss  NaN

I tried use named capturing groups:

result = src.str.extract('^(?:\d+ (?P<field>\w+)|(?P<field>\w+) \d+)$')

but I got error:

sre_constants.error: redefinition of group name 'field' as group 2

I don't know how to solve this problem when I use alternation operator...

And the next question: how to solve the same problem when string in the series does not match regex? It's necessary to return NaN in such case.

UPDATE: I found solution using str.cat:

result = src.str.extract('^(?:\d+ (\w+)|(\w+) \d+)$')
result = result[0].str.cat(result[1], na_rep='')

But it requires additional actions... So I still find more elegant solution without changing number of capturing groups in the regex.


回答1:


Consider using a non-capturing group for the before/after part of the string, and just one capturing group. Make both non-capturing groups (before, after) optional. You won't detect erroneous lines this way, but it should get you what you need:

r"^(?:\d+\s+)?(\w+)(?:\s+\d+)?$"

Now, your true data is more complex. Consider adding a "looking at" zero-width assertion that defines the "correct" structure of your data. This will require the field be 'valid', according to whatever regex you already have. You have already done this work, you just need to put (?=...) around the regex, and convert capture groups to non-capture groups.

Next, separate all the alternate cases into before/capture/after sets. You have already done this work, you just have to organize the sets.

Now unify the before sets and the after sets with alternation, and a non-capturing group. Unify the "capture" sets with alternation and a capturing group. If possible, eliminate redundancy (two alternatives with the same pattern).

If your regex was something like:

r"A(B)C|D(E)F|G(H)I|J(K)L"

You convert that to a look-ahead pattern:

r"(?=A(?:B)C|D(?:E)F|G(?:H)I|J(?:K)L)"

You create a non-capturing "before" alternation:

r"(?:A|D|G|J)"

And a non-capturing "after" alternation:

r"(?:C|F|I|L)"

Finally, a capturing "capture" alternation:

r"(C|E|H|K)"

Put them all together:

r"(?=A(?:B)C|D(?:E)F|G(?:H)I|J(?:K)L)(?:A|D|G|J)(C|E|H|K)(?:C|F|I|L)"

It's ugly as can be, and you'll probably want to use embedded comments to document it, but it will work.



来源:https://stackoverflow.com/questions/35096360/extract-single-substring-from-each-row-of-the-series-using-regular-expression-wi

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!