How to clean a string to get value_counts for words of interest by date?

牧云@^-^@ 提交于 2020-12-08 05:08:43

问题


I have the following data generated from a groupby('Datetime') and value_counts()

Datetime        0          
01/01/2020  Paul            8
            03              2
01/02/2020  Paul            2
            10982360967     1
01/03/2020  religion        3
                           ..
02/28/2020  l              18
02/29/2020  Paul           78
            march          22
03/01/2020  church         63
            l              21

I would like to remove a specific name (in this case I would like to remove 'Paul') and all the numbers (03, 10982360967 in this specific example). I do not know why there is a character 'l' as I had tried to remove stopwords including alphabet (and numbers). Do you know how I could further clean this selection?

Expected output to avoid confusion:

Datetime        0          
01/03/2020  religion        3
                           ..
02/29/2020  march          22
03/01/2020  church         63

I removed Paul, 03, 109..., and l.

Raw data:

Datetime        Corpus          
01/03/2020      Paul: examples of religion
01/03/2020      Paul:shinto is a religion 03
01/03/2020      don't talk to me about religion, Paul 03
...
02/29/2020     march is the third month of the year 10982360967
02/29/2020     during march, there are some cold days.
...
03/01/2020     she is at church right now
...

I cannot put all the raw data as I have more than 100 sentences.

The code I used is:

df.Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)

Since I got a Key error, I had to edit the code as follows:

df.set_index('Datetime').Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)

To extract the words I used str.extractall


回答1:


Cleaning strings is a multi-step process

Create dataframe

import pandas as pd
from nltk.corpus import stopwords
import string

# data and dataframe
data = {'Datetime': ['01/03/2020', '01/03/2020', '01/03/2020', '02/29/2020', '02/29/2020', '03/01/2020'],
        'Corpus': ['Paul: Examples of religion',
                   'Paul:shinto is a religion 03',
                   "don't talk to me about religion, Paul 03",
                   'march is the third month of the year 10982360967',
                   'during march, there are some cold days.',
                   'she is at church right now']}

test = pd.DataFrame(data)
test.Datetime = pd.to_datetime(test.Datetime)

|    | Datetime            | Corpus                                           |
|---:|:--------------------|:-------------------------------------------------|
|  0 | 2020-01-03 00:00:00 | Paul: Examples of religion                       |
|  1 | 2020-01-03 00:00:00 | Paul:shinto is a religion 03                     |
|  2 | 2020-01-03 00:00:00 | don't talk to me about religion, Paul 03         |
|  3 | 2020-02-29 00:00:00 | march is the third month of the year 10982360967 |
|  4 | 2020-02-29 00:00:00 | during march, there are some cold days.          |
|  5 | 2020-03-01 00:00:00 | she is at church right now                       |

Clean Corpus

  • Add extra words to the remove_words list
    • They should be lowercase
  • Some cleaning steps could be combined, but I do not recommend that
    • Step-by-step makes it easier to determine if you've made a mistake
  • This is a small example of text cleaning.
    • There are entire books on the subject.
    • There's not context analysis
      • example = 'We march to the church in March.'
      • value_count for 'march' in example.lower() is 2
# words to remove
remove_words = list(stopwords.words('english'))
# extra words to remove
additional_remove_words = ['paul', 'shinto', 'examples', 'talk', 'third', 'month', 'year', 'cold', 'days', 'right']
remove_words.extend(additional_remove_words)  # add other words to exclude in lowercase

# punctuation to remove
punctuation = string.punctuation
punc = r'[{}]'.format(punctuation)

test.dropna(inplace=True)  # drop any na rows

# clean text now
test.Corpus = test.Corpus.str.replace('\d+', '')  # remove numbers

test.Corpus = test.Corpus.str.replace(punc, ' ')  # remove punctuation 

test.Corpus = test.Corpus.str.replace('\\s+', ' ')  # remove occurrences of more than one whitespace

test.Corpus = test.Corpus.str.strip()  # remove whitespace from beginning and end of string

test.Corpus = test.Corpus.str.lower()  # convert all to lowercase

test.Corpus = test.Corpus.apply(lambda x: list(word for word in x.split() if word not in remove_words))  # remove words

|    | Datetime            | Corpus       |
|---:|:--------------------|:-------------|
|  0 | 2020-01-03 00:00:00 | ['religion'] |
|  1 | 2020-01-03 00:00:00 | ['religion'] |
|  2 | 2020-01-03 00:00:00 | ['religion'] |
|  3 | 2020-02-29 00:00:00 | ['march']    |
|  4 | 2020-02-29 00:00:00 | ['march']    |
|  5 | 2020-03-01 00:00:00 | ['church']   |

Explode Corpus & groupby

# explode list
test = test.explode('Corpus')

# dropna incase there are empty rows from filtering
test.dropna(inplace=True)

# groupby
test.groupby('Datetime').agg({'Corpus': 'value_counts'}).rename(columns={'Corpus': 'word_count'})

                     word_count
Datetime   Corpus              
2020-01-03 religion           3
2020-02-29 march              2
2020-03-01 church             1


来源:https://stackoverflow.com/questions/62236140/how-to-clean-a-string-to-get-value-counts-for-words-of-interest-by-date

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!