I'd recommend instead of re-inventing the wheel, you use Apache SoIr for search and analysis. It has almost everything you might need, including stop-word detection for 30+ languages [as far as I can remember, might be even more] and do tons of stuff with data stored in it.