I need to find connected components for a huge dataset. (Graph being Undirected)
One obvious choice is MapReduce. But i\'m a newbie to MapReduce and am quiet short of ti
I blogged about it for myself:
http://codingwiththomas.blogspot.de/2011/04/graph-exploration-with-hadoop-mapreduce.html
But MapReduce isn't a good fit for these Graph analysis things. Better use BSP (bulk synchronous parallel) for that, Apache Hama provides a good graph API on top of Hadoop HDFS.
I've written a connected components algorithm with MapReduce here: (Mindist search)
https://github.com/thomasjungblut/tjungblut-graph/tree/master/src/de/jungblut/graph/mapreduce
Also a BSP version for Apache Hama can be found here:
https://github.com/thomasjungblut/tjungblut-graph/blob/master/src/de/jungblut/graph/bsp/MindistSearch.java
The implementation isn't as difficult as in MapReduce and it is at least 10 times faster. If you're interested, checkout the latest version in TRUNK and visit our mailing list.
http://hama.apache.org/
http://apache.org/hama/mail-lists.html