We use lucene to process live streams from the internet. It has a native java api. 
http://lucene.apache.org/java/docs/ 
You can then use mahout which is a bunch of machien learning algorithms which operate on top of lucene. 
http://lucene.apache.org/mahout/