hey all, just getting started on hadoop and curious what the best way in mapreduce would be to count unique visitors if your logfiles looked like this...
Use the secondary sort to sort on user id. That way, you don't need to have anything in memory -- just stream the data through, and increment your distinct counter every time you see the value change for a particular site id.