Sort a file with huge volume of data given memory constraint

前端 未结 12 1025
暖寄归人
暖寄归人 2020-11-28 21:47

Points:

  • We process thousands of flat files in a day, concurrently.
  • Memory constraint is a major issue.
  • We use thread for each file process
12条回答
  •  粉色の甜心
    2020-11-28 22:17

    I would spin up an EC2 cluster and run Hadoop's MergeSort.

    Edit: not sure how much detail you would like, or on what. EC2 is Amazon's Elastic Compute Cloud - it lets you rent virtual servers by the hour at low cost. Here is their website.

    Hadoop is an open-source MapReduce framework designed for parallel processing of large data sets. A job is a good candidate for MapReduce when it can be split into subsets that can be processed individually and then merged together, usually by sorting on keys (ie the divide-and-conquer strategy). Here is its website.

    As mentioned by the other posters, external sorting is also a good strategy. I think the way I would decide between the two depends on the size of the data and speed requirements. A single machine is likely going to be limited to processing a single file at a time (since you will be using up available memory). So look into something like EC2 only if you need to process files faster than that.

提交回复
热议问题