问题
I have the following data generated from a groupby('Datetime')
and value_counts()
Datetime 0
01/01/2020 Paul 8
03 2
01/02/2020 Paul 2
10982360967 1
01/03/2020 religion 3
..
02/28/2020 l 18
02/29/2020 Paul 78
march 22
03/01/2020 church 63
l 21
I would like to remove a specific name (in this case I would like to remove 'Paul') and all the numbers (03, 10982360967 in this specific example). I do not know why there is a character 'l' as I had tried to remove stopwords including alphabet (and numbers). Do you know how I could further clean this selection?
Expected output to avoid confusion:
Datetime 0
01/03/2020 religion 3
..
02/29/2020 march 22
03/01/2020 church 63
I removed Paul, 03, 109..., and l.
Raw data:
Datetime Corpus
01/03/2020 Paul: examples of religion
01/03/2020 Paul:shinto is a religion 03
01/03/2020 don't talk to me about religion, Paul 03
...
02/29/2020 march is the third month of the year 10982360967
02/29/2020 during march, there are some cold days.
...
03/01/2020 she is at church right now
...
I cannot put all the raw data as I have more than 100 sentences.
The code I used is:
df.Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)
Since I got a Key error, I had to edit the code as follows:
df.set_index('Datetime').Corpus.groupby('Datetime').value_counts().groupby('Datetime').head(2)
To extract the words I used str.extractall
回答1:
Cleaning strings is a multi-step process
Create dataframe
import pandas as pd
from nltk.corpus import stopwords
import string
# data and dataframe
data = {'Datetime': ['01/03/2020', '01/03/2020', '01/03/2020', '02/29/2020', '02/29/2020', '03/01/2020'],
'Corpus': ['Paul: Examples of religion',
'Paul:shinto is a religion 03',
"don't talk to me about religion, Paul 03",
'march is the third month of the year 10982360967',
'during march, there are some cold days.',
'she is at church right now']}
test = pd.DataFrame(data)
test.Datetime = pd.to_datetime(test.Datetime)
| | Datetime | Corpus |
|---:|:--------------------|:-------------------------------------------------|
| 0 | 2020-01-03 00:00:00 | Paul: Examples of religion |
| 1 | 2020-01-03 00:00:00 | Paul:shinto is a religion 03 |
| 2 | 2020-01-03 00:00:00 | don't talk to me about religion, Paul 03 |
| 3 | 2020-02-29 00:00:00 | march is the third month of the year 10982360967 |
| 4 | 2020-02-29 00:00:00 | during march, there are some cold days. |
| 5 | 2020-03-01 00:00:00 | she is at church right now |
Clean Corpus
- Add extra words to the
remove_words
list- They should be lowercase
- Some cleaning steps could be combined, but I do not recommend that
- Step-by-step makes it easier to determine if you've made a mistake
- This is a small example of text cleaning.
- There are entire books on the subject.
- There's not context analysis
example = 'We march to the church in March.'
value_count
for'march'
inexample.lower()
is 2
# words to remove
remove_words = list(stopwords.words('english'))
# extra words to remove
additional_remove_words = ['paul', 'shinto', 'examples', 'talk', 'third', 'month', 'year', 'cold', 'days', 'right']
remove_words.extend(additional_remove_words) # add other words to exclude in lowercase
# punctuation to remove
punctuation = string.punctuation
punc = r'[{}]'.format(punctuation)
test.dropna(inplace=True) # drop any na rows
# clean text now
test.Corpus = test.Corpus.str.replace('\d+', '') # remove numbers
test.Corpus = test.Corpus.str.replace(punc, ' ') # remove punctuation
test.Corpus = test.Corpus.str.replace('\\s+', ' ') # remove occurrences of more than one whitespace
test.Corpus = test.Corpus.str.strip() # remove whitespace from beginning and end of string
test.Corpus = test.Corpus.str.lower() # convert all to lowercase
test.Corpus = test.Corpus.apply(lambda x: list(word for word in x.split() if word not in remove_words)) # remove words
| | Datetime | Corpus |
|---:|:--------------------|:-------------|
| 0 | 2020-01-03 00:00:00 | ['religion'] |
| 1 | 2020-01-03 00:00:00 | ['religion'] |
| 2 | 2020-01-03 00:00:00 | ['religion'] |
| 3 | 2020-02-29 00:00:00 | ['march'] |
| 4 | 2020-02-29 00:00:00 | ['march'] |
| 5 | 2020-03-01 00:00:00 | ['church'] |
Explode Corpus
& groupby
# explode list
test = test.explode('Corpus')
# dropna incase there are empty rows from filtering
test.dropna(inplace=True)
# groupby
test.groupby('Datetime').agg({'Corpus': 'value_counts'}).rename(columns={'Corpus': 'word_count'})
word_count
Datetime Corpus
2020-01-03 religion 3
2020-02-29 march 2
2020-03-01 church 1
来源:https://stackoverflow.com/questions/62236140/how-to-clean-a-string-to-get-value-counts-for-words-of-interest-by-date