问题
I have a list of words(around 1000 words), I call them negative words.
['CAST','ARTICLES','SANITARY','JAN','CLAUSES','SPECIAL','ENDORSEMENT']
I'll soon be making a dataframe out of this list of words.
I also have a dataframe which looks like -
FileName PageNo LineNo GOODS_DESC
1 17668620 TM000004 36 CAST ARTICLES IRON SANITARY
59 17668620 TM000014 41 CRATES
60 17668620 TM000014 42 CAST ARTICLES IRON
61 17668620 TM000014 49 JAN ANIMAL AND VEGETABLE
63 17668620 TM000016 49 SETTLING AGENT
65 17668620 TM000016 29 JAN
66 17668620 TM000016 32 CLAUSES SPECIAL CONDITIONS WARRANTIES
67 17668620 TM000016 37 CARGO ISM ENDORSEMENT
69 17668620 TM000017 113 QUANTITY DECLARED IRON CRATES
I want to remove the negative words from the dataframe (as fast a possible). and get the refined dataframe. So that the dataframe looks like this - dataframe out of this list of words.
I also have a dataframe which looks like -
FileName PageNo LineNo GOODS_DESC
1 17668620 TM000004 36 IRON
59 17668620 TM000014 41 CRATES
60 17668620 TM000014 42 IRON
61 17668620 TM000014 49 ANIMAL AND VEGETABLE
63 17668620 TM000016 49 SETTLING AGENT
65 17668620 TM000016 29 NaN
66 17668620 TM000016 32 CONDITIONS WARRANTIES
67 17668620 TM000016 37 CARGO ISM
69 17668620 TM000017 113 QUANTITY DECLARED IRON CRATES
Currently my approach is that I'm iterating over the dataframe, taking each row and splitting it and checking wether the splitted word is in negative words list or not.IF its not there then I'm making a new string by joining the words and adding it in the dataframe.
for rows in df.itertuples():
a = []
flat_list = []
a.append(rows.GOODS_DESC)
flat_list = [item.strip() for sublist in a for item in sublist.split(' ') if item.strip()]
flat_list = list(sorted(set(flat_list), key=flat_list.index))
flat_list = [i for i in flat_list if i.lower() not in negative_words_list]
if(not flat_list):
df.drop(rows.Index,inplace=True)
continue
s=' '.join(flat_list)
df.loc[rows.Index,'GOODS_DESC']=s
df['GOODS_DESC'] = df['GOODS_DESC'].str.upper()
The only problem with this approach is that its too slow.
If you have any hint,logic then do share. Can someone show me how this process can be done using pandas dataframe in less time.
回答1:
Due to the slowness and loopiness of .str accessor in pandas, it may be better to just use list comprehension:
import re
l=['CAST','ARTICLES','SANITARY','JAN','CLAUSES','SPECIAL','ENDORSEMENT']
df['GOODS_DESC'] = [re.sub('|'.join(l),'',i).strip() if re.sub('|'.join(l),'',i).strip() != '' else np.nan for i in df.GOODS_DESC]
Output:
FileName PageNo LineNo GOODS_DESC
1 17668620 TM000004 36 IRON
59 17668620 TM000014 41 CRATES
60 17668620 TM000014 42 IRON
61 17668620 TM000014 49 ANIMAL AND VEGETABLE
63 17668620 TM000016 49 SETTLING AGENT
65 17668620 TM000016 29 NaN
66 17668620 TM000016 32 CONDITIONS WARRANTIES
67 17668620 TM000016 37 CARGO ISM
69 17668620 TM000017 113 QUANTITY DECLARED IRON CRATES
Timings
%timeit[re.sub('|'.join(l),'',i).strip() if re.sub('|'.join(l),'',i).strip() != '' else np.nan for i in df.GOODS_DESC]
89.6 µs ± 667 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Using .str accessor
%timeit df['GOODS_DESC'].str.replace('|'.join(l),'').str.strip()
466 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
回答2:
This should be fairly fast.
import re
neg = ['CAST','ARTICLES','SANITARY','JAN','CLAUSES','SPECIAL','ENDORSEMENT']
pat = re.compile('|'.join(neg))
df['GOODS_DESC'] = [re.sub('\s+', ' ', re.sub(pat, '', s)).strip() for s in df.GOODS_DESC]
df.loc[df.GOODS_DESC=='', 'GOODS_DESC'] = np.nan
回答3:
try this,
l=['CAST','ARTICLES','SANITARY','JAN','CLAUSES','SPECIAL','ENDORSEMENT']
df['GOODS_DESC']=df['GOODS_DESC'].str.replace('|'.join(l),'').str.strip()
Output:
GOODS_DESC
0 IRON
1 CRATES
2 IRON
3 ANIMAL AND VEGETABLE
4 SETTLING AGENT
5
6 CONDITIONS WARRANTIES
7 CARGO ISM
8 QUANTITY DECLARED IRON CRATES
回答4:
Try textblob and find the polarity. The range is between 0 and 1. If the value of a sentence is less than 0.5, Target those string and replace them.
来源:https://stackoverflow.com/questions/50584596/removal-of-substring-from-all-dataframe-columns