发表新帖

发表新帖

Sort a file with huge volume of data given memory constraint

前端未结

关注

 12  1025

暖寄归人 2020-11-28 21:47

Points:

We process thousands of flat files in a day, concurrently.
Memory constraint is a major issue.
We use thread for each file process

12条回答

粉色の甜心 (楼主)

2020-11-28 22:17

I would spin up an EC2 cluster and run Hadoop's MergeSort.

Edit: not sure how much detail you would like, or on what. EC2 is Amazon's Elastic Compute Cloud - it lets you rent virtual servers by the hour at low cost. Here is their website.

Hadoop is an open-source MapReduce framework designed for parallel processing of large data sets. A job is a good candidate for MapReduce when it can be split into subsets that can be processed individually and then merged together, usually by sorting on keys (ie the divide-and-conquer strategy). Here is its website.

As mentioned by the other posters, external sorting is also a good strategy. I think the way I would decide between the two depends on the size of the data and speed requirements. A single machine is likely going to be limited to processing a single file at a time (since you will be using up available memory). So look into something like EC2 only if you need to process files faster than that.

0 讨论(0)

查看其它12个回答
发布评论:

提交评论
- 加载中...

热议问题