I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on t
Isn't the problem several orders of magnitude easier if you've only got two languages (English and German) to choose from? In this case your approach of a list of stop words might be good enough.
Obviously you'd need to consider a rewrite if you added more languages to your list.