Usage scenario
We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the webs
I've been working on a similar "guess the encoding" problem. The best solution involves knowing the encoding. Barring that, you can make educated guesses to distinguish between UTF-8 and ISO-8859-1.
To answer the general question of how to detect if a string is properly encoded UTF-8, you can verify the following things:
If a string passes all those tests, then it's interpretable as valid UTF-8. That doesn't guarantee that it is UTF-8, but it's a good predictor.
Legal input in ISO-8859-1 will likely have no control characters (0x00-0x1F and 0x80-0x9F) other than line separators. Looks like 0x7F isn't defined in ISO-8859-1 either.
(I'm basing this off of Wikipedia pages for UTF-8 and ISO-8859-1.)