How to use mmap in python when the whole file is too big

前端未结

关注

 2  1460

再見小時候 2020-12-24 03:33

I have a python script which read a file line by line and look if each line matches a regular expression.

I would like to improve the performance of that script by u

2条回答

难免孤独 (楼主)

2020-12-24 03:59
I came to try using mmap because I used fileh.readline() on a file being dozens of GB in size and wanted to make it faster. Unix strace utility seems to reveal that the file is read in 4kB chunks now, and at least the output from strace seems to me printed slowly and I know parsing the file takes many hours.
```
$ strace -v -f -p 32495
Process 32495 attached
read(5, "blah blah blah foo bar xxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
^CProcess 32495 detached
$
```
This thread is so far the only explaining me I should not try to mmap a too large file. I do not understand why isn't there already a helper function like mmap_for_dummies(filename) which would do internally os.path.size(filename) and then either doing normal open(filename, 'r', buffering=10*1024*1024) or doing mmap.mmap(open(filename).fileno()). I certainly want to avoid fiddling with sliding window approach myself but would the function do a simple decision whether to do mmap or not would be enough for me.

Finally to mention, it is still not clear to me why some examples on the internet mention open(filename, 'rb') without explanation (e.g. https://docs.python.org/2/library/mmap.html). Provided one often wants to use the file in a for loop with .readline() call I do not know if I should open in 'rb' or just 'r' mode (I guess it is necessary to preserve the '\n').

Thanks for mentioning the buffering=10*1024*1024) argument, is probably more helpful than changing my code to gain some speed.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...