NLTK: corpus-level bleu vs sentence-level BLEU score

后端 未结 2 1870
我在风中等你
我在风中等你 2020-12-14 10:06

I have imported nltk in python to calculate BLEU Score on Ubuntu. I understand how sentence-level BLEU score works, but I don\'t understand how corpus-level BLEU score work.

相关标签:
2条回答
  • 2020-12-14 10:48

    Let's take a look:

    >>> help(nltk.translate.bleu_score.corpus_bleu)
    Help on function corpus_bleu in module nltk.translate.bleu_score:
    
    corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=None)
        Calculate a single corpus-level BLEU score (aka. system-level BLEU) for all 
        the hypotheses and their respective references.  
    
        Instead of averaging the sentence level BLEU scores (i.e. marco-average 
        precision), the original BLEU metric (Papineni et al. 2002) accounts for 
        the micro-average precision (i.e. summing the numerators and denominators
        for each hypothesis-reference(s) pairs before the division).
        ...
    

    You're in a better position than me to understand the description of the algorithm, so I won't try to "explain" it to you. If the docstring does not clear things up enough, take a look at the source itself. Or find it locally:

    >>> nltk.translate.bleu_score.__file__
    '.../lib/python3.4/site-packages/nltk/translate/bleu_score.py'
    
    0 讨论(0)
  • 2020-12-14 11:11

    TL;DR:

    >>> import nltk
    >>> hypothesis = ['This', 'is', 'cat'] 
    >>> reference = ['This', 'is', 'a', 'cat']
    >>> references = [reference] # list of references for 1 sentence.
    >>> list_of_references = [references] # list of references for all sentences in corpus.
    >>> list_of_hypotheses = [hypothesis] # list of hypotheses that corresponds to list of references.
    >>> nltk.translate.bleu_score.corpus_bleu(list_of_references, list_of_hypotheses)
    0.6025286104785453
    >>> nltk.translate.bleu_score.sentence_bleu(references, hypothesis)
    0.6025286104785453
    

    (Note: You have to pull the latest version of NLTK on the develop branch in order to get a stable version of the BLEU score implementation)


    In Long:

    Actually, if there's only one reference and one hypothesis in your whole corpus, both corpus_bleu() and sentence_bleu() should return the same value as shown in the example above.

    In the code, we see that sentence_bleu is actually a duck-type of corpus_bleu:

    def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25),
                      smoothing_function=None):
        return corpus_bleu([references], [hypothesis], weights, smoothing_function)
    

    And if we look at the parameters for sentence_bleu:

     def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25),
                          smoothing_function=None):
        """"
        :param references: reference sentences
        :type references: list(list(str))
        :param hypothesis: a hypothesis sentence
        :type hypothesis: list(str)
        :param weights: weights for unigrams, bigrams, trigrams and so on
        :type weights: list(float)
        :return: The sentence-level BLEU score.
        :rtype: float
        """
    

    The input for sentence_bleu's references is a list(list(str)).

    So if you have a sentence string, e.g. "This is a cat", you have to tokenized it to get a list of strings, ["This", "is", "a", "cat"] and since it allows for multiple references, it has to be a list of list of string, e.g. if you have a second reference, "This is a feline", your input to sentence_bleu() would be:

    references = [ ["This", "is", "a", "cat"], ["This", "is", "a", "feline"] ]
    hypothesis = ["This", "is", "cat"]
    sentence_bleu(references, hypothesis)
    

    When it comes to corpus_bleu() list_of_references parameter, it's basically a list of whatever the sentence_bleu() takes as references:

    def corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25),
                    smoothing_function=None):
        """
        :param references: a corpus of lists of reference sentences, w.r.t. hypotheses
        :type references: list(list(list(str)))
        :param hypotheses: a list of hypothesis sentences
        :type hypotheses: list(list(str))
        :param weights: weights for unigrams, bigrams, trigrams and so on
        :type weights: list(float)
        :return: The corpus-level BLEU score.
        :rtype: float
        """
    

    Other than look at the doctest within the nltk/translate/bleu_score.py, you can also take a look at the unittest at nltk/test/unit/translate/test_bleu_score.py to see how to use each of the component within the bleu_score.py.

    By the way, since the sentence_bleu is imported as bleu in the (nltk.translate.__init__.py](https://github.com/nltk/nltk/blob/develop/nltk/translate/init.py#L21), using

    from nltk.translate import bleu 
    

    would be the same as:

    from nltk.translate.bleu_score import sentence_bleu
    

    and in code:

    >>> from nltk.translate import bleu
    >>> from nltk.translate.bleu_score import sentence_bleu
    >>> from nltk.translate.bleu_score import corpus_bleu
    >>> bleu == sentence_bleu
    True
    >>> bleu == corpus_bleu
    False
    
    0 讨论(0)
提交回复
热议问题