Binary search over a huge file with unknown line length

元气小坏坏 提交于 2019-12-06 01:20:33

You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:

bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
  left, right = 0, bytes - 1 
  key = None
  while key != search and left <= right:
    mid = (left + right) / 2
    fin.seek(mid)
    # now realign to a record
    if mid:
        fin.readline()
    key, value = map(int, fin.readline().split())
    if search > key:
      left = mid + 1
    else:
      right = mid - 1
  if key != search:
    value = None # for when search key is not found
  search.result = value # store the result of the search
Googol

To resolve it, you also can use binary search, but you need to change it a bit:

  1. Get the file size.
  2. Seek to the middle of size, with File.seek.
  3. And search the first EOL character. Then you find a new line.
  4. Check this line's key and if not what you want, update size and go to 2.

Here is a sample code:

fp = open('your file')
fp.seek(0, 2)
begin = 0
end = fp.tell()

while (begin < end):
    fp.seek((end + begin) / 2, 0)
    fp.readline()
    line_key = get_key(fp.readline())
    if (key == line_key):
        pass # find what you want
    elif (key > line_key):
        begin = fp.tell()
    else:
        end = fp.tell()

Maybe the code has bugs. Verify yourself. And please check the performance if you really want a fastest way.

The answer on the referenced question that says binary search only works with fixed-length records is wrong. And you don't need to do a search at all, since you have multiple items to look up. Just walk through the entire file one line at a time, build a dictionary of key:offset for each line, and then for each of your search items jump to the record of interest using os.lseek on the offset corresponding to each key.

Of course, if you don't want to read the entire file even once, you'll have to do a binary search. But if building the index can be amortized over several lookups, perhaps saving the index if you only do one lookup per day, then a search is unnecessary.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!