Why perplexity for padded vocabulary is infinitive for nltk.lm bigram?

后端 未结 1 1497
情歌与酒
情歌与酒 2021-01-26 23:23

I am testing the perplexity measure for a language model for a text:

  train_sentences = nltk.sent_tokenize(train_text)
  test_sentences = nltk.sent         


        
相关标签:
1条回答
  • 2021-01-26 23:52

    The input to perplexity is text in ngrams not a list of strings. You can verify the same by running

    for x in test_text:
        print ([((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in x])
    

    You should see that the tokens(ngrams) are all wrong.

    You will still get inf in the perplexity if your words in test data are out of vocab (of train data)

    train_sentences = nltk.sent_tokenize(train_text)
    test_sentences = nltk.sent_tokenize(test_text)
    
    train_sentences = ['an apple', 'an orange']
    test_sentences = ['an apple']
    
    train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                    for sent in train_sentences]
    
    test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                    for sent in test_sentences]
    
    from nltk.lm.preprocessing import padded_everygram_pipeline
    from nltk.lm import MLE,Laplace
    from nltk.lm import Vocabulary
    
    n = 1
    train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
    model = MLE(n)
    # fit on padded vocab that the model know the new tokens added to vocab (<s>, </s>, UNK etc)
    model.fit(train_data, padded_vocab) 
    
    test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
    for test in test_data:
        print("per all", model.perplexity(test))
    
    # out of vocab test data
    test_sentences = ['an ant']
    test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                    for sent in test_sentences]
    test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
    for test in test_data:
        print("per all [oov]", model.perplexity(test))
    
    0 讨论(0)
提交回复
热议问题