How do I find the frequency count of a word in English using WordNet?

后端 未结 6 1159
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-07 23:27

Is there a way to find the frequency of the usage of a word in the English language using WordNet or NLTK using Python?

NOTE: I do not want the frequency count of a

相关标签:
6条回答
  • 2020-12-08 00:04

    In WordNet, every Lemma has a frequency count that is returned by the method lemma.count(), and which is stored in the file nltk_data/corpora/wordnet/cntlist.rev.

    Code example:

    from nltk.corpus import wordnet
    syns = wordnet.synsets('stack')
    for s in syns:
        for l in s.lemmas():
            print l.name + " " + str(l.count())
    

    Result:

    stack 2
    batch 0
    deal 1
    flock 1
    good_deal 13
    great_deal 10
    hatful 0
    heap 2
    lot 13
    mass 14
    mess 0
    ...
    

    However, many counts are zero and there is no information in the source file or in the documentation which corpus was used to create this data. According to the book Speech and Language Processing from Daniel Jurafsky and James H. Martin, the sense frequencies come from the SemCor corpus which is a subset of the already small and outdated Brown Corpus.

    So it's probably best to choose the corpus that fits best to the your application and create the data yourself as Christopher suggested.

    To make this Python3.x compatible just do:

    Code example:

    from nltk.corpus import wordnet
    syns = wordnet.synsets('stack')
    for s in syns:
        for l in s.lemmas():
            print( l.name() + " " + str(l.count()))
    
    0 讨论(0)
  • 2020-12-08 00:05

    You can sort of do it using the brown corpus, though it's out of date (last revised in 1979), so it's missing lots of current words.

    import nltk
    from nltk.corpus import brown
    from nltk.probability import *
    
    words = FreqDist()
    
    for sentence in brown.sents():
        for word in sentence:
            words.inc(word.lower())
    
    print words["and"]
    print words.freq("and")
    

    You could then cpickle the FreqDist off to a file for faster loading later.

    A corpus is basically just a file full of sentences, one per line, and there are lots of other corpora out there, so you could probably find one that fits your purpose. A couple of other sources of more current corpora: Google, American National Corpus.

    You can also suppsedly get a current list of the top 60,000 words and their frequencies from the Corpus of Contemporary American English

    0 讨论(0)
  • 2020-12-08 00:07

    You can't really do this, because it depends so much on the context. Not only that, for less frequent words the frequency will be wildly dependent on the sample.

    Your best bet is probably to find a large corpus of text of the given genre (e.g. download a hundred books from Project Gutenberg) and count the words yourself.

    0 讨论(0)
  • 2020-12-08 00:15

    The Wiktionary project has a few frequency lists based on TV scripts and Project Gutenberg, but their format is not particularly nice for parsing.

    0 讨论(0)
  • 2020-12-08 00:17

    Take a look at the Information Content section of the Wordnet Similarity project at http://wn-similarity.sourceforge.net/. There you will find databases of word frequencies (or, rather, information content, which is derived from word frequency) of Wordnet lemmas, calculated from several different corpora. The source codes are in Perl, but the databases are provided independently and can be easily used with NLTK.

    0 讨论(0)
  • 2020-12-08 00:19

    Check out this site for word frequencies: http://corpus.byu.edu/coca/

    Somebody compiled a list of words taken from opensubtitles.org (movie scripts). There's a free simple text file formatted like this available for download. In many different languages.

    you 6281002
    i 5685306
    the 4768490
    to 3453407
    a 3048287
    it 2879962
    

    http://invokeit.wordpress.com/frequency-word-lists/

    0 讨论(0)
提交回复
热议问题