Why are stop words not being excluded from the word cloud when using Python's wordcloud library?

妖精的绣舞 提交于 2020-06-28 04:04:43

问题


I want to exclude 'The', 'They' and 'My' from being displayed in my word cloud. I'm using the python library 'wordcloud' as below, and updating the STOPWORDS list with these 3 additional stopwords, but the wordcloud is still including them. What do I need to change so that these 3 words are excluded?

The libraries I imported are:

import numpy as np
import pandas as pd
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

I've tried adding elements to the STOPWORDS set at follows but, even though the words are added successfully, the wordcloud still shows the 3 words I added to the STOPWORDS set:

len(STOPWORDS) Outputs: 192

Then I ran:

STOPWORDS.add('The')
STOPWORDS.add('They')
STOPWORDS.add('My')

Then I ran:

len(STOPWORDS) Outputs: 195

I'm running python version 3.7.3

I know I could amend the text input to remove the 3 words (rather than trying to amend WordCloud's STOPWORDS set) before running the wordcloud but I was wondering if there's a bug with WordCloud or whether I'm not updating/using STOPWORDS correctly?


回答1:


The default for a Wordcloud is that collocations=True, so frequent phrases of two adjacent words are included in the cloud - and importantly for your issue, with collocations the removal of stopwords is different, so that for example “Thank you” is a valid collocation and may appear in the generated cloud even though “you” is in the default stopwords. Collocations which contain only stopwords are removed.

The not unreasonable-sounding rationale for this is that if stopwords were removed before building the list of collocations then e.g. “thank you very much” would provide “thank very” as a collocation, which I definitely wouldn’t want.

So to get your stopwords to work perhaps how you expect, i.e. no stopwords at all appear in the cloud, you could use collocations=False like this:

my_wordcloud = WordCloud(
    stopwords=my_stopwords,
    background_color='white', 
    collocations=False, 
    max_words=10).generate(all_tweets_as_one_string)

UPDATE:

  • With collocations False, stopwords are all lowercased for comparison with lowercased text when removing them - so no need to add 'The' etc.
  • With collocations True (the default) while stopwords are lowercased, when looking for all-stopwords collocations to remove them, the source text isn't lower-cased so e.g.g The in the text isn't removed while the is removed - that's why @Balaji Ambresh's code works, and you'll see that there are no caps in the cloud. This might be a defect in Wordcloud, not sure. However adding e.g. The to stopwords won't affects this because stopwords is always lowercased regardless of collocations True/False

This is all visible in the source code :-)

For example with the default collocations=True I get:

And with collocations=False I get:

Code:

from wordcloud import WordCloud
from matplotlib import pyplot as plt


text = "The bear sat with the cat. They were good friends. " + \
        "My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
        "there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
            "It was such a lovely day. The bear was loving it too."

cloud = WordCloud(collocations=False,
        background_color='white',
        max_words=10).generate(text)
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()



回答2:


pip install nltk

Don't forget to install stopwords.

python
>>> import nltk
>>> nltk.download('stopwords')

Give this a shot:

from wordcloud import WordCloud
from matplotlib import pyplot as plt

from nltk.corpus import stopwords

stopwords = set(stopwords.words('english'))

text = "The bear sat with the cat. They were good friends. " + \
        "My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
        "there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
            "It was such a lovely day. The bear was loving it too."
cloud = WordCloud(stopwords=stopwords,
        background_color='white',
        max_words=10).generate(text.lower())
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()


来源:https://stackoverflow.com/questions/61953788/why-are-stop-words-not-being-excluded-from-the-word-cloud-when-using-pythons-wo

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!