Removal of substring from all dataframe columns

问题

I have a list of words(around 1000 words), I call them negative words.

['CAST','ARTICLES','SANITARY','JAN','CLAUSES','SPECIAL','ENDORSEMENT']

I'll soon be making a dataframe out of this list of words.

I also have a dataframe which looks like -

    FileName    PageNo     LineNo   GOODS_DESC                   
1   17668620    TM000004    36      CAST ARTICLES IRON SANITARY  
59  17668620    TM000014    41      CRATES                       
60  17668620    TM000014    42      CAST ARTICLES IRON           
61  17668620    TM000014    49      JAN ANIMAL AND VEGETABLE     
63  17668620    TM000016    49      SETTLING AGENT               
65  17668620    TM000016    29      JAN 
66  17668620    TM000016    32      CLAUSES SPECIAL CONDITIONS WARRANTIES   
67  17668620    TM000016    37      CARGO ISM ENDORSEMENT
69  17668620    TM000017    113     QUANTITY DECLARED IRON CRATES

I want to remove the negative words from the dataframe (as fast a possible). and get the refined dataframe. So that the dataframe looks like this - dataframe out of this list of words.

I also have a dataframe which looks like -

    FileName    PageNo     LineNo   GOODS_DESC                   
1   17668620    TM000004    36      IRON 
59  17668620    TM000014    41      CRATES                       
60  17668620    TM000014    42      IRON             
61  17668620    TM000014    49      ANIMAL AND VEGETABLE     
63  17668620    TM000016    49      SETTLING AGENT               
65  17668620    TM000016    29      NaN
66  17668620    TM000016    32      CONDITIONS WARRANTIES   
67  17668620    TM000016    37      CARGO ISM
69  17668620    TM000017    113     QUANTITY DECLARED IRON CRATES

Currently my approach is that I'm iterating over the dataframe, taking each row and splitting it and checking wether the splitted word is in negative words list or not.IF its not there then I'm making a new string by joining the words and adding it in the dataframe.

for rows in df.itertuples():
    a = []
    flat_list = []
    a.append(rows.GOODS_DESC)
    flat_list = [item.strip() for sublist in a for item in sublist.split(' ') if item.strip()]
    flat_list = list(sorted(set(flat_list), key=flat_list.index))
    flat_list = [i for i in flat_list if i.lower() not in negative_words_list]

    if(not flat_list):
        df.drop(rows.Index,inplace=True)
        continue
    s=' '.join(flat_list)
    df.loc[rows.Index,'GOODS_DESC']=s
df['GOODS_DESC'] = df['GOODS_DESC'].str.upper()

The only problem with this approach is that its too slow.

If you have any hint,logic then do share. Can someone show me how this process can be done using pandas dataframe in less time.

回答1:

Due to the slowness and loopiness of .str accessor in pandas, it may be better to just use list comprehension:

import re
l=['CAST','ARTICLES','SANITARY','JAN','CLAUSES','SPECIAL','ENDORSEMENT']
df['GOODS_DESC'] = [re.sub('|'.join(l),'',i).strip() if re.sub('|'.join(l),'',i).strip() != '' else np.nan for i in df.GOODS_DESC]

Output:

    FileName    PageNo  LineNo                     GOODS_DESC
1   17668620  TM000004      36                           IRON
59  17668620  TM000014      41                         CRATES
60  17668620  TM000014      42                           IRON
61  17668620  TM000014      49           ANIMAL AND VEGETABLE
63  17668620  TM000016      49                 SETTLING AGENT
65  17668620  TM000016      29                            NaN
66  17668620  TM000016      32          CONDITIONS WARRANTIES
67  17668620  TM000016      37                      CARGO ISM
69  17668620  TM000017     113  QUANTITY DECLARED IRON CRATES

Timings

%timeit[re.sub('|'.join(l),'',i).strip() if re.sub('|'.join(l),'',i).strip() != '' else np.nan for i in df.GOODS_DESC]

89.6 µs ± 667 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Using .str accessor

%timeit df['GOODS_DESC'].str.replace('|'.join(l),'').str.strip()

466 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

回答2:

This should be fairly fast.

import re
neg =  ['CAST','ARTICLES','SANITARY','JAN','CLAUSES','SPECIAL','ENDORSEMENT']
pat = re.compile('|'.join(neg))
df['GOODS_DESC'] =  [re.sub('\s+', ' ', re.sub(pat, '', s)).strip() for s in df.GOODS_DESC]
df.loc[df.GOODS_DESC=='', 'GOODS_DESC'] = np.nan

回答3:

try this,

l=['CAST','ARTICLES','SANITARY','JAN','CLAUSES','SPECIAL','ENDORSEMENT']

df['GOODS_DESC']=df['GOODS_DESC'].str.replace('|'.join(l),'').str.strip()

Output:

                      GOODS_DESC
0                           IRON
1                         CRATES
2                           IRON
3           ANIMAL AND VEGETABLE
4                 SETTLING AGENT
5                               
6          CONDITIONS WARRANTIES
7                     CARGO ISM 
8  QUANTITY DECLARED IRON CRATES

回答4:

Try textblob and find the polarity. The range is between 0 and 1. If the value of a sentence is less than 0.5, Target those string and replace them.

来源：https://stackoverflow.com/questions/50584596/removal-of-substring-from-all-dataframe-columns

标签

python

pandas

dataframe

python-3.6