问题
I'm trying to search rather big files for a certain string and return its offset. I'm new to lua and my current approach would look like this:
linenumber = 0
for line in io.lines(filepath) do
result=string.find(line,"ABC",1)
linenumber = linenumber+1
if result ~= nil then
offset=linenumber*4096+result
io.close
end
end
I realize that this way is rather primitive and certainly slow. How could I do this more efficiently?
Thanks in advance.
回答1:
If the file is not too big, and you can spare the memory, it's faster to just slurp in the whole file and just use string.find
. If not you can search the file by block.
Your approach isn't all that bad. I'd suggest loading the file in overlapping blocks though. The overlap avoids having the pattern split just between the blocks and going unnoticed like:
".... ...A BC.. ...."
My implementation goes like this:
size=4096 -- note, size should be bigger than the length of pat to work.
pat="ABC"
overlap=#pat
fh=io.open(filepath,'rb') -- On windows, do NOT forget the b
block=fh:read(size+overlap)
n=0
while block do
block_offset=block:find(pat)
if block_offset then
print(block_offset)
offset=block_offset+size*n
break
end
fh:seek('cur',-overlap)
cur=fh:seek'cur'
block=fh:read(size+overlap)
n=n+1
end
if offset then
print('found pattern at', offset, 'after reading',n,'blocks')
else
print('did not find pattern')
end
If your file really has lines, you can also use the trick explained here. This section in the Programming in Lua book explains some performance considerations reading files.
回答2:
Unless your lines have all the same lenght (4096), I don't see how your code can work.
Instead of using io.lines
, read blocks with io.read(4096)
. The rest of your code can be used as is, except that you need to handle the case that your string is not fully inside a block. If the files is composed of lines, then a trick mentioned in Programming in Lua is to do io.read(4096,"*l")
, to read blocks that end at line boundaries. Then you don't have to worry about strings not fully inside a block but you need to adjust the offset calculation to include the length of the block, not just 4096.
来源:https://stackoverflow.com/questions/8907949/return-offset-of-a-string-with-lua