Using Pandas to Filter String In Cell with Multiple Values

耗尽温柔 提交于 2020-06-27 11:57:49

问题


I am using pandas to filter a data frame using str.contains() but my logic is dropping values that I might want to keep given the string. I don't know how to use Pandas to sort this out.

A sample cell in the excel sheet that I am working with would look like:

Case #1: Don't flag this because there is a different recipient, bob@gmail.com

Recipient
---------
joe@work.com, bob@gmail.com, sally@work.com

Case #2: Flag this because every recipient contains @work.com

Recipient
---------
mike@work.com, taylor@work.com, barbra@work.com

I have a situation where I only need it to filter if a specific value occurs. For example, if 'Recipient' contains the email joe@work.com, drop this value. But if Recipient column contains 'joe@work.com, bob@gmail.com' (Yes, the values are separated in a comma like that in a single cell.) and keep it. Eventually, this dataframe will be dropped from a final report. So I want to drop everything that just contains @work.com, but don't drop if it contains a @gmail.com, @work.com.

This query below is dropping everything even if the Recipient column contains 'gmail.com'

df['EMAIL10'] = df['Type'].str.contains('Email') & df['Type'].str.contains(
                'Tracking | Data') & df[
                                'Recipient'].str.contains('@work.com') 

Let me know if I need to clarify


回答1:


You can create a Boolean Mask that indicates whether or not all separate words contain '@work'.

First, split so that each word is placed into a separate cell, and explode will turn this into one big Series, with the index duplicated and pointing back to the index of your original DataFrame. .str.contains checks your condition and all(level=0) checks whether it's True for every word in a row from your original DataFrame.

import pandas as pd

df = pd.DataFrame({'col': ['joe@work.com, bob@gmail.com, sally@work.com', 
                           'mike@work.com, taylor@work.com, barbra@work.com']})

df['all_work'] = df['col'].str.split(', ').explode().str.contains('@work').all(level=0)

print(df)
                                               col  all_work
0      joe@work.com, bob@gmail.com, sally@work.com     False
1  mike@work.com, taylor@work.com, barbra@work.com      True

For explanation, after split and explode we have:

df['col'].str.split(', ').explode()

 0       joe@work.com 
 0      bob@gmail.com   # Each item split separately
 0     sally@work.com
 1      mike@work.com
 1    taylor@work.com
 1    barbra@work.com
#|
#Index corresponds to Index of the original DataFrame



回答2:


I think you can use explode then groupby to filter out the @work emails

print(df)

                                         Recipient
0      joe@work.com, bob@gmail.com, sally@work.com
1  mike@work.com, taylor@work.com, barbra@work.com

s = df['Recipient'].str.split(',').explode()
df['flag removed'] = s[~s.str.contains('@work')].groupby(level=0).agg(','.join)

print(df)

                                         Recipient    flag removed
0      joe@work.com, bob@gmail.com, sally@work.com   bob@gmail.com
1  mike@work.com, taylor@work.com, barbra@work.com             NaN

you can .dropna() to remove the rows with no matches




回答3:


You should get significant speed benefits if u run string processing within Python :

df["all_work"] = [all("@work" in text for text in ent.split(","))
                  for ent in df.col ]

                 col                               all_work
0   joe@work.com, bob@gmail.com, sally@work.com     False
1   mike@work.com, taylor@work.com, barbra@work.com True



回答4:


Let us try something with str.count

df.col.str.count('@work.com')==df.col.str.count(',').add(1)
Out[148]: 
0    False
1     True
Name: col, dtype: bool


来源:https://stackoverflow.com/questions/62199414/using-pandas-to-filter-string-in-cell-with-multiple-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!