Why is the number of stem from NLTK Stemmer outputs different from expected output?

问题

I have to perform Stemming on a text. The questions are as follows :

Tokenize all the words given in tc. The word should contain alphabets or numbers or underscore. Store the tokenized list of words in tw
Convert all the words into lowercase. Store the result into the variable tw
Remove all the stop words from the unique set of tw. Store the result into the variable fw
Stem each word present in fw with PorterStemmer, and store the result in the list psw

Below is my code :

import re
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem  import PorterStemmer,LancasterStemmer

pattern = r'\w+';
tw= nltk.regexp_tokenize(tc,pattern);
tw= [word.lower() for word in tw];
stop_word = set(stopwords.words('english'));
fw= [w for w in tw if not w in stop_word];
#print(sorted(filteredwords));
porter = PorterStemmer();
psw = [porter.stem(word) for word in fw];
print(sorted(psw));

My code works perfectly with all the provided testcases in hand-on but it fails only for the below test case where

tc = "I inadvertently went to See's Candy last week (I was in the mall looking for phone repair), and as it turns out, See's Candy now charges a dollar -- a full dollar -- for even the simplest of their wee confection offerings. I bought two chocolate lollipops and two chocolate-caramel-almond things. The total cost was four-something. I mean, the candies were tasty and all, but let's be real: A Snickers bar is fifty cents. After this dollar-per-candy revelation, I may not find myself wandering dreamily back into a See's Candy any time soon."

My Output is :

['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', 'lollipop', 'look', 'mall', 'may', 'mean', 'offer', 'per', 'phone', 'real', 'repair', 'revel', 'see', 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', 'went']

Expected Output is :

['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', 'lollipop', 'look', 'mall', 'may', 'mean', 'offer', 'per', 'phone', 'real', 'repair', 'revel', 'see', 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', 'went']

The difference is the occurrence of 'Candi'

Looking help to troubleshoot the issue.

回答1:

Firstly, don't iterate through the text multiple times, see Why is my NLTK function slow when processing the DataFrame?

Do this instead and you only iterate through your data/text once:

import re

from nltk import word_tokenize, regexp_tokenize
from nltk.corpus import stopwords
from nltk.stem  import PorterStemmer

stop_word = set(stopwords.words('english'))
porter = PorterStemmer()

text = "I inadvertently went to See's Candy last week (I was in the mall looking for phone repair), and as it turns out, See's Candy now charges a dollar -- a full dollar -- for even the simplest of their wee confection offerings. I bought two chocolate lollipops and two chocolate-caramel-almond things. The total cost was four-something. I mean, the candies were tasty and all, but let's be real: A Snickers bar is fifty cents. After this dollar-per-candy revelation, I may not find myself wandering dreamily back into a See's Candy any time soon."

signature = [porter.stem(word.lower()) 
             for word in regexp_tokenize(text,r'\w+') 
             if word.lower() not in stop_word]

Next, lets check against the expected output:

signature = [(word, porter.stem(word.lower())) for word in regexp_tokenize(text,r'\w+')]

expected = ['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', 'lollipop', 'look', 'mall', 'may', 'mean', 'offer', 'per', 'phone', 'real', 'repair', 'revel', 'see', 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', 'went']

sorted(signature) == expected  # -> False

[out]:

False

That's not a good sign, lets find which terms are missing:

# If item in signature but not in expected.
len(set(signature).difference(expected)) == 0  # -> True
# If item in expected but not in signature. 
len(set(expected).difference(signature)) == 0  # -> True

In that case, lets check the counts:

print(len(signature), len(expected))

[out]:

57 49

Seems like your expected output is missing quite a few items. Checking through:

from collections import Counter
counter_signature = Counter(signature)
counter_expected = Counter(expected)


for word, count in counter_signature.items():
    # If the count in expected is different.
    expected_count = counter_expected[word]
    if count != expected_count: 
        print(word, count, expected_count)

It seems like not only candi has different count!

[out]:

see 3 1
candi 5 3
dollar 3 1
two 2 1
chocol 2 1

It looks like the signature (i.e. processed text) contains a lot more counts than the expected from the expected output in the question. So most probably the test you have is not counting things right =)

回答2:

Try using:

import re
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem  import PorterStemmer,LancasterStemmer

pattern = r'\w+';
tw= nltk.regexp_tokenize(tc,pattern);
tw= [word.lower() for word in tw];
unique_tw = set(tw); #Unique Set of Tokenized words(See Your Step3)
stop_word = set(stopwords.words('english'));
fw= [w for w in unique_tw if not w in stop_word];# Remove stopwords from unique_tw
porter = PorterStemmer();
psw = [porter.stem(word) for word in fw];
print(sorted(psw));

As step 3 is: Remove all the stop words from the unique set of tw.

来源：https://stackoverflow.com/questions/62626878/why-is-the-number-of-stem-from-nltk-stemmer-outputs-different-from-expected-outp

标签

python

list

nlp

nltk

stemming