This is an interview question. Suppose there are a few computers and each computer keeps a very large log file of visited URLs. Find the top ten most visited URLs.
The below description is the idea for the solution. it is not a pseudocode.
Consider you have a collection of systems.
1.for each A: Collections(systems)
1.1) Run a daemonA in each computer which probes on the log file for changes.
1.2) When a change is noticed, wakeup AnalyzerThreadA
1.3) If AnalyzerThreadA finds a URL using some regex, then update localHashMapA with count++.
(key = URL, value = count ).
2) Push topTen entries of localHashMapA to ComputerA where AnalyzeAll daemon will be running.
The above step will be the last step in each system, which will push topTen entries to a master system, say for example: computerA.
3) AnalyzeAll running in computerA will resolve duplicates and update count in masterHashMap of URLs.
4) Print the topTen from masterHashMap.