Why doesn't Python's mmap work with large files?

落花浮王杯 提交于 2019-11-27 07:11:47

From IEEE 1003.1:

The mmap() function shall establish a mapping between a process' address space and a file, shared memory object, or [TYM] typed memory object.

It needs all the virtual address space because that's exactly what mmap() does.

The fact that it isn't really running out of memory doesn't matter - you can't map more address space than you have available. Since you then take the result and access as if it were memory, how exactly do you propose to access more than 2^32 bytes into the file? Even if mmap() didn't fail, you could still only read the first 4GB before you ran out of space in a 32-bit address space. You can, of course, mmap() a sliding 32-bit window over the file, but that won't necessarily net you any benefit unless you can optimize your access pattern such that you limit how many times you have to visit previous windows.

Sorry to answer my own question, but I think the real problem I had was not realising that mmap was a standard POSIX system call with particular characterisatations and limitations and that the Python mmap is supposed just to expose its functionality.

The Python documentation doesn't mention the POSIX mmap and so if you come at it as a Python programmer without much knowledge of POSIX (as I did) then the address space problem appears quite arbitrary and badly designed!

Thanks to the other posters for teaching me the true meaning of mmap. Unfortunately no one has suggested a better alternative to my hand-crafted class for treating large files as strings, so I shall have to stick with it for now. Perhaps I will clean it up and make it part of my module's public interface when I get the chance.

A 32-bit program and operating system can only address a maximum of 32 bits of memory i.e. 4GB. There are other factors that make the total even smaller; for example, Windows reserves between 0.5 and 2GB for hardware access, and of course your program is going to take some space as well.

Edit: The obvious thing you're missing is an understanding of the mechanics of mmap, on any operating system. It allows you to map a portion of a file to a range of memory - once you've done that, any access to that portion of the file happens with the least possible overhead. It's low overhead because the mapping is done once, and doesn't have to change every time you access a different range. The drawback is that you need an open address range sufficient for the portion you're trying to map. If you're mapping the whole file at once, you'll need a hole in the memory map large enough to fit the entire file. If such a hole doesn't exist, or is bigger than your entire address space, it fails.

the mmap module provides all the tools you need to poke around in your large file, but due to the limitations other folks have mentioned, you can't map it all at once. You can map a good sized chunk at once, do some processing and then unmap that and map another. the key arguments to the mmap class are length and offset, which do exactly what they sound like, allowing you to map length bytes, starting at byte offset in the mapped file. Any time you wish to read a section of memory that is outside the mapped window, you have to map in a new window.

The point you are missing is that mmap is a memory mapping function that maps a file into memory for arbitrary access across the requested data range by any means.

What you are looking for sounds more like some sort of a data window class that presents an api allowing you to look at small windows of a large data structure at anyone time. Access beyond the bounds of this window would not be possible other than by calling the data window's own api.

This is fine, but it is not a memory map, it is something that offers the advantage of a wider data range at the cost of a more restrictive api.

You're setting the length parameter to zero, which means map in the entire file. On a 32 bit build, this won't be possible if the file length is more than 2GB (possibly 4GB).

RGD2

Use a 64-bit computer, with a 64-bit OS and a 64-bit python implementation, or avoid mmap()

mmap() requires CPU hardware support to make sense with large files bigger than a few GiB.

It uses the CPU's MMU and interrupt subsystems to allow exposing the data as if it were already loaded ram.

The MMU is hardware which will generate an interrupt whenever an address corresponding to data not in physical RAM is accessed, and the OS will handle the interrupt in a way that makes sense at runtime, so the accessing code never knows (or needs to know) that the data doesn't fit in RAM.

This makes your accessing code simple to write. However, to use mmap() this way, everything involved will need to handle 64 bit addresses.

Or else it may be preferable to avoid mmap() altogether and do your own memory management.

You ask the OS to map the entire file in a memory range. It won't be read until you trigger page faults by reading/writing, but it still needs to make sure the entire range is available to your process, and if that range is too big, there will be difficulties.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!