Extract regex matches, and not groups, in data frames rows in Python

∥☆過路亽.° 提交于 2021-01-29 01:44:16

问题


I am a novice in coding and I generally use R for this (stringr) but I started to learn also Python's syntax.

I have a data frame with one column generated from an imported excel file. The values in this column contain both capital and smallcase characters, symbols and numbers.

I would like to generate a second column in the data frame containing only some of these words included in the first column according to a regex pattern.

df = pd.DataFrame(["THIS IS A TEST 123123. s.m.", "THIS IS A Test test 123 .s.c.e", "TESTING T'TEST 123 da."],columns=['Test'])

df

Now, to extract what I want (words in capital case), in R I would generally use:

df <- str_extract_all(df$Test, "\\b[A-Z]{1,}\\b", simplify = FALSE)

to extract the matches of the regular expression in different data frame rows, which are:

* THIS IS A TEST
* THIS IS A
* TESTING T TEST

I couldn't find a similar solution for Python, and the closest I've got to is the following:

df["Name"] = df["Test"].str.extract(r"(\b[A-Z]{1,}\b)", expand = True)

Unfortunately this does not work, as it exports only the groups rather than the matches of the regex. I've tried multiple strategies, but also str.extractall does not seem to work ("TypeError: incompatible index of inserted column with frame index)

How can I extract the information I want with Python?

Thanks!


回答1:


If I understand well, you can try:

df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)")
                       .unstack().fillna('').apply(' '.join, 1)

[EDIT]: Here is a shorter version I discovered by looking at the doc:

 df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)").unstack(fill_value='').apply(' '.join, 1)



回答2:


You are on the right track of getting the pattern. This solution uses regular expression, join and map.

 df['Name'] = df['Test'].map(lambda x: ' '.join(re.findall(r"\b[A-Z\s]+\b", x)))

Result:

  Test  Name
0   THIS IS A TEST 123123. s.m.     THIS IS A TEST
1   THIS IS A Test test 123 .s.c.e  THIS IS A
2   TESTING T'TEST 123 da.          TESTING T TEST


来源:https://stackoverflow.com/questions/55797875/extract-regex-matches-and-not-groups-in-data-frames-rows-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!