what is a baseline and what is a benchmark? what is the best definition for these and how do you baseline a set of numbers and benchmark another set?
In scientific research, a benchmark is a kind of test and a baseline is a kind of result.
Let's look at an example of a benchmark test: we might take a collection of 5,000 sentences in English and use the lab's four-core Dell machine to translate them into Spanish using various algorithms. Because we've kept the data and the machine constant, we can meaningfully compare the time taken by the different algorithms to complete the task, as well as their relative accuracy (measured against gold-standard human translations).
To find a baseline for this benchmark test, we might write a very naive translation algorithm that just finds the commonest translation for each individual word, with no regard for the context. Measuring the accuracy of this algorithm against our human translations gives us an idea of the minimum score - the baseline - that the others must beat, and gives us a feel for what level of accuracy counts as "good".
At the other end of the scale from a baseline, an upper bound is a useful yardstick too. In the translation example, we might find the upper bound by measuring the accuracy of one of our human translations with respect to the others. This gives us an idea of how high it's possible to get on our "accuracy" measure before you hit the ceiling of human disagreement. We expect our machine translation algorithms to perform at a level between the baseline and the upper bound.