How do I find the frequency count of a word in English using WordNet?

后端未结

关注

 6  1167

爱一瞬间的悲伤

Is there a way to find the frequency of the usage of a word in the English language using WordNet or NLTK using Python?

NOTE: I do not want the frequency count of a

相关标签:

6条回答

情深已故

2020-12-08 00:04
In WordNet, every Lemma has a frequency count that is returned by the method lemma.count(), and which is stored in the file nltk_data/corpora/wordnet/cntlist.rev.

Code example:
```
from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print l.name + " " + str(l.count())
```
Result:
```
stack 2
batch 0
deal 1
flock 1
good_deal 13
great_deal 10
hatful 0
heap 2
lot 13
mass 14
mess 0
...
```
However, many counts are zero and there is no information in the source file or in the documentation which corpus was used to create this data. According to the book Speech and Language Processing from Daniel Jurafsky and James H. Martin, the sense frequencies come from the SemCor corpus which is a subset of the already small and outdated Brown Corpus.

So it's probably best to choose the corpus that fits best to the your application and create the data yourself as Christopher suggested.

To make this Python3.x compatible just do:

Code example:
```
from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print( l.name() + " " + str(l.count()))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人共我

2020-12-08 00:05
You can sort of do it using the brown corpus, though it's out of date (last revised in 1979), so it's missing lots of current words.
```
import nltk
from nltk.corpus import brown
from nltk.probability import *

words = FreqDist()

for sentence in brown.sents():
    for word in sentence:
        words.inc(word.lower())

print words["and"]
print words.freq("and")
```
You could then cpickle the FreqDist off to a file for faster loading later.

A corpus is basically just a file full of sentences, one per line, and there are lots of other corpora out there, so you could probably find one that fits your purpose. A couple of other sources of more current corpora: Google, American National Corpus.

You can also suppsedly get a current list of the top 60,000 words and their frequencies from the Corpus of Contemporary American English
0 讨论(0)
发布评论:

提交评论
- 加载中...
离开以前

2020-12-08 00:07

You can't really do this, because it depends so much on the context. Not only that, for less frequent words the frequency will be wildly dependent on the sample.

Your best bet is probably to find a large corpus of text of the given genre (e.g. download a hundred books from Project Gutenberg) and count the words yourself.

0 讨论(0)
发布评论:

提交评论
- 加载中...
南方客

2020-12-08 00:15

The Wiktionary project has a few frequency lists based on TV scripts and Project Gutenberg, but their format is not particularly nice for parsing.

0 讨论(0)
发布评论:

提交评论
- 加载中...
生来不讨喜

2020-12-08 00:17

Take a look at the Information Content section of the Wordnet Similarity project at http://wn-similarity.sourceforge.net/. There you will find databases of word frequencies (or, rather, information content, which is derived from word frequency) of Wordnet lemmas, calculated from several different corpora. The source codes are in Perl, but the databases are provided independently and can be easily used with NLTK.

0 讨论(0)
发布评论:

提交评论
- 加载中...
天涯浪人

2020-12-08 00:19
Check out this site for word frequencies: http://corpus.byu.edu/coca/

Somebody compiled a list of words taken from opensubtitles.org (movie scripts). There's a free simple text file formatted like this available for download. In many different languages.
```
you 6281002
i 5685306
the 4768490
to 3453407
a 3048287
it 2879962
```
http://invokeit.wordpress.com/frequency-word-lists/
0 讨论(0)
发布评论:

提交评论
- 加载中...