This is an interview question. Suppose there are a few computers and each computer keeps a very large log file of visited URLs. Find the top ten most visited URLs.
Assuming the conditions below are true:
I would take the approach below:
Each node reads a portion of the file (ie. MAX urls, where MAX can be, let's say, 1000 urls) and keeps an array arr[MAX]={url,hits}.
When a node has read MAX urls off the file, it sends the list to the master node, and restarts reads until MAX urls is reached again.
When a node reaches the EOF, he sends the remaining list of urls and an EOF flag to the master node.
When the master node receives a list of urls, it compares it with its last list of urls and generates a new, updated one.
When the master node receives the EOF flag from every node and finishes reading his own file, the top n urls of the last version of his list are the ones we're looking for.
Or
A different approach that would release the master from doing all the job could be:
Every node reads its file and stores an array same as above, reading until EOF.
When EOF, the node will send the first url of the list and the number of hits to the master.
When the master has collected the first url and number of hits for each node, it generates a list. If the master node has less than n urls, it will ask the nodes to send the second one and so on. Until the master has the n urls sorted.