Binary search over a huge file with unknown line length

荒凉一梦 提交于 2019-12-10 10:38:28

问题


I'm working with huge data CSV file. Each file contains milions of record , each record has a key. The records are sorted by thier key. I dont want to go over the whole file when searching for certian data. I've seen this solution : Reading Huge File in Python

But it suggests that you use the same length of lines on the file - which is not supported in my case.

I thought about adding a padding to each line and then keeping a fixed line length , but I'd like to know if there is a better way to do it.

I'm working with python


回答1:


You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:

bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
  left, right = 0, bytes - 1 
  key = None
  while key != search and left <= right:
    mid = (left + right) / 2
    fin.seek(mid)
    # now realign to a record
    if mid:
        fin.readline()
    key, value = map(int, fin.readline().split())
    if search > key:
      left = mid + 1
    else:
      right = mid - 1
  if key != search:
    value = None # for when search key is not found
  search.result = value # store the result of the search



回答2:


To resolve it, you also can use binary search, but you need to change it a bit:

  1. Get the file size.
  2. Seek to the middle of size, with File.seek.
  3. And search the first EOL character. Then you find a new line.
  4. Check this line's key and if not what you want, update size and go to 2.

Here is a sample code:

fp = open('your file')
fp.seek(0, 2)
begin = 0
end = fp.tell()

while (begin < end):
    fp.seek((end + begin) / 2, 0)
    fp.readline()
    line_key = get_key(fp.readline())
    if (key == line_key):
        pass # find what you want
    elif (key > line_key):
        begin = fp.tell()
    else:
        end = fp.tell()

Maybe the code has bugs. Verify yourself. And please check the performance if you really want a fastest way.




回答3:


The answer on the referenced question that says binary search only works with fixed-length records is wrong. And you don't need to do a search at all, since you have multiple items to look up. Just walk through the entire file one line at a time, build a dictionary of key:offset for each line, and then for each of your search items jump to the record of interest using os.lseek on the offset corresponding to each key.

Of course, if you don't want to read the entire file even once, you'll have to do a binary search. But if building the index can be amortized over several lookups, perhaps saving the index if you only do one lookup per day, then a search is unnecessary.



来源:https://stackoverflow.com/questions/8369175/binary-search-over-a-huge-file-with-unknown-line-length

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!