Binary Search - How to load +5M records from a file into Range<int>[] array?

痴心易碎 提交于 2019-12-13 09:07:48

问题


This question is a follow up to my previous question regarding binary search (Fast, in-memory range lookup against +5M record table).

I have sequential text file, with over 5M records/lines, in the format below. I need to load it into Range<int>[] array. How would one do that in a timely fashion?

File format:

start int64,end int64,result int
start int64,end int64,result int
start int64,end int64,result int
start int64,end int64,result int
...

回答1:


I'm going to assume you have a good disk. Scan through the file once and count the number of entries. If you can guarantee your file has no blank lines, then you can just count the number of newlines in it -- don't actually parse each line.

Now you can allocate your array once with exactly that many entries. This avoids excessive re-allocations of the array:

var numEntries = File.ReadLines(filepath).Count();
var result = new Range<int>[numEntries];

Now read the file again and create your range objects with code something like:

var i = 0;
foreach (var line in File.ReadLines(filepath))
{
   var parts = line.Split(',');
   result[i++] = new Range<int>(long.Parse(parts[0]), long.Parse(parts[1]), int.Parse(parts[2]);
}

return result;

Sprinkle in some error handling as desired. This code is easy to understand. Try it out in your target environment. If it is too slow, then you can start optimizing it. I wouldn't optimize prematurely though because that will lead to much more complex code that might not be needed.




回答2:


This is a typical (?) producer-consumer problem which can be solved using multiple threads. In your case the producer is reading data from disk and the consumer is parsing the lines and populating the array. I can see two different cases:

  • Producer is (much) faster than the consumer: in this case you should try using more consumer threads;
  • Consumer is (much) faster than the producer: you can't do very much to speed up things other than affecting your hardware configuration such as buying a faster hard disk or using a RAID 0. In this case I wouldn't even use a multithreading solution because it's not worth the added complexity.

This question might help you implementing that in C#.



来源:https://stackoverflow.com/questions/15276164/binary-search-how-to-load-5m-records-from-a-file-into-rangeint-array

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!