Multi staged data storage, processing and retrieval
EFFICIENT Distribution (100's of 1000's of machines) of the above tasks
Good framework to store the raw data and the processed results
Good framework to retrieve the results
How exactly all these are done is summarized by all the links that you have in the question summary