bleu | 易学教程

Text Summarization Evaluation - BLEU vs ROUGE

阅读更多关于 Text Summarization Evaluation - BLEU vs ROUGE

问题 With the results of two different summary systems (sys1 and sys2) and the same reference summaries, I evaluated them with both BLEU and ROUGE. The problem is: All ROUGE scores of sys1 was higher than sys2 (ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-4, ROUGE-L, ROUGE-SU4, ...) but the BLEU score of sys1 was less than the BLEU score of sys2 (quite much). So my question is: Both ROUGE and BLEU are based on n-gram to measure the similar between the summaries of systems and the summaries of human. So why

NLTK: corpus-level bleu vs sentence-level BLEU score

阅读更多关于 NLTK: corpus-level bleu vs sentence-level BLEU score

问题 I have imported nltk in python to calculate BLEU Score on Ubuntu. I understand how sentence-level BLEU score works, but I don't understand how corpus-level BLEU score work. Below is my code for corpus-level BLEU score: import nltk hypothesis = ['This', 'is', 'cat'] reference = ['This', 'is', 'a', 'cat'] BLEUscore = nltk.translate.bleu_score.corpus_bleu([reference], [hypothesis], weights = [1]) print(BLEUscore) For some reason, the bleu score is 0 for the above code. I was expecting a corpus

NLTK: corpus-level bleu vs sentence-level BLEU score

阅读更多关于 NLTK: corpus-level bleu vs sentence-level BLEU score

I have imported nltk in python to calculate BLEU Score on Ubuntu. I understand how sentence-level BLEU score works, but I don't understand how corpus-level BLEU score work. Below is my code for corpus-level BLEU score: import nltk hypothesis = ['This', 'is', 'cat'] reference = ['This', 'is', 'a', 'cat'] BLEUscore = nltk.translate.bleu_score.corpus_bleu([reference], [hypothesis], weights = [1]) print(BLEUscore) For some reason, the bleu score is 0 for the above code. I was expecting a corpus-level BLEU score of at least 0.5. Here is my code for sentence-level BLEU score import nltk hypothesis =