numpy.memmap: bogus memory allocation

问题

I have a python3 script that operates with numpy.memmap arrays. It writes an array to newly generated temporary file that is located in /tmp:

import numpy, tempfile

size = 2 ** 37 * 10
tmp = tempfile.NamedTemporaryFile('w+')
array = numpy.memmap(tmp.name, dtype = 'i8', mode = 'w+', shape = size)
array[0] = 666
array[size-1] = 777
del array
array2 = numpy.memmap(tmp.name, dtype = 'i8', mode = 'r+', shape = size)
print('File: {}. Array size: {}. First cell value: {}. Last cell value: {}'.\
      format(tmp.name, len(array2), array2[0], array2[size-1]))
while True:
    pass

The size of the HDD is only 250G. Nevertheless, it can somehow generate 10T large files in /tmp, and the corresponding array still seems to be accessible. The output of the script is following:

File: /tmp/tmptjfwy8nr. Array size: 1374389534720. First cell value: 666. Last cell value: 777

The file really exists and is displayed as being 10T large:

$ ls -l /tmp/tmptjfwy8nr
-rw------- 1 user user 10995116277760 Dec  1 15:50 /tmp/tmptjfwy8nr

However, the whole size of /tmp is much smaller:

$ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       235G  5.3G  218G   3% /

The process also is pretending to use 10T virtual memory, which is also not possible. The output of top command:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
31622 user      20   0 10.000t  16592   4600 R 100.0  0.0   0:45.63 python3

As far as I understand, this means that during the call of numpy.memmap the needed memory for the whole array is not allocated and therefore displayed file size is bogus. This in turn means that when I start to gradually fill the whole array with my data, at some point my program will crash or my data will be corrupted.

Indeed, if I introduce the following in my code:

for i in range(size):
    array[i] = i

I get the error after a while:

Bus error (core dumped)

Therefore, the question: how to check at the beginning, if there is really enough memory for the data and then indeed reserve the space for the whole array?

回答1:

There's nothing 'bogus' about the fact that you are generating 10 TB files

You are asking for arrays of size

2 ** 37 * 10 = 1374389534720 elements

A dtype of 'i8' means an 8 byte (64 bit) integer, therefore your final array will have a size of

1374389534720 * 8 = 10995116277760 bytes

10995116277760 / 1E12 = 10.99511627776 TB

If you only have 250 GB of free disk space then how are you able to create a "10 TB" file?

Assuming that you are using a reasonably modern filesystem, your OS will be capable of generating almost arbitrarily large sparse files, regardless of whether or not you actually have enough physical disk space to back them.

For example, on my Linux machine I'm allowed to do something like this:

# I only have about 50GB of free space...
~$ df -h /
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/sdb1      ext4  459G  383G   53G  88% /

~$ dd if=/dev/zero of=sparsefile bs=1 count=0 seek=10T
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000236933 s, 0.0 kB/s

# ...but I can still generate a sparse file that reports its size as 10 TB
~$ ls -lah sparsefile
-rw-rw-r-- 1 alistair alistair 10T Dec  1 21:17 sparsefile

# however, this file uses zero bytes of "actual" disk space
~$ du -h sparsefile
0       sparsefile

Try calling du -h on your np.memmap file after it has been initialized to see how much actual disk space it uses.

As you start actually writing data to your np.memmap file, everything will be OK until you exceed the physical capacity of your storage, at which point the process will terminate with a Bus error. This means that if you needed to write < 250GB of data to your np.memmap array then there might be no problem (in practice this would probably also depend on where you are writing within the array, and on whether it is row or column major).

How is it possible for a process to use 10 TB of virtual memory?

When you create a memory map, the kernel allocates a new block of addresses within the virtual address space of the calling process and maps them to a file on your disk. The amount of virtual memory that your Python process is using will therefore increase by the size of the file that has just been created. Since the file can also be sparse, then not only can the virtual memory exceed the total amount of RAM available, but it can also exceed the total physical disk space on your machine.

How can you check whether you have enough disk space to store the full `np.memmap` array?

I'm assuming that you want to do this programmatically in Python.

Get the amount of free disk space available. There are various methods given in the answers to this previous SO question. One option is os.statvfs:
```
import os

def get_free_bytes(path='/'):
    st = os.statvfs(path)
    return st.f_bavail * st.f_bsize

print(get_free_bytes())
# 56224485376
```

Work out the size of your array in bytes:

import numpy as np

def check_asize_bytes(shape, dtype):
    return np.prod(shape) * np.dtype(dtype).itemsize

print(check_asize_bytes((2 ** 37 * 10,), 'i8'))
# 10995116277760

Check whether 2. > 1.

Update: Is there a 'safe' way to allocate an `np.memmap` file, which guarantees that sufficient disk space is reserved to store the full array?

One possibility might be to use fallocate to pre-allocate the disk space, e.g.:

~$ fallocate -l 1G bigfile

~$ du -h bigfile
1.1G    bigfile

You could call this from Python, for example using subprocess.check_call:

import subprocess

def fallocate(fname, length):
    return subprocess.check_call(['fallocate', '-l', str(length), fname])

def safe_memmap_alloc(fname, dtype, shape, *args, **kwargs):
    nbytes = np.prod(shape) * np.dtype(dtype).itemsize
    fallocate(fname, nbytes)
    return np.memmap(fname, dtype, *args, shape=shape, **kwargs)

mmap = safe_memmap_alloc('test.mmap', np.int64, (1024, 1024))

print(mmap.nbytes / 1E6)
# 8.388608

print(subprocess.check_output(['du', '-h', 'test.mmap']))
# 8.0M    test.mmap

I'm not aware of a platform-independent way to do this using the standard library, but there is a fallocate Python module on PyPI that should work for any Posix-based OS.

回答2:

Based on the answer of @ali_m I finally came to this solution:

# must be called with the argumant marking array size in GB
import sys, numpy, tempfile, subprocess

size = (2 ** 27) * int(sys.argv[1])
tmp_primary = tempfile.NamedTemporaryFile('w+')
array = numpy.memmap(tmp_primary.name, dtype = 'i8', mode = 'w+', shape = size)
tmp = tempfile.NamedTemporaryFile('w+')
check = subprocess.Popen(['cp', '--sparse=never', tmp_primary.name, tmp.name])
stdout, stderr = check.communicate()
if stderr:
    sys.stderr.write(stderr.decode('utf-8'))
    sys.exit(1)
del array
tmp_primary.close()
array = numpy.memmap(tmp.name, dtype = 'i8', mode = 'r+', shape = size)
array[0] = 666
array[size-1] = 777
print('File: {}. Array size: {}. First cell value: {}. Last cell value: {}'.\
      format(tmp.name, len(array), array[0], array[size-1]))
while True:
    pass

The idea is to copy initially generated sparse file to a new normal one. For this cp with the option --sparse=never is employed.

When the script is called with a manageable size parameter (say, 1 GB) the array is getting mapped to a non-sparse file. This is confirmed by the output of du -h command, which now shows ~1 GB size. If the memory is not enough, the scripts exits with the error:

cp: ‘/tmp/tmps_thxud2’: write failed: No space left on device

来源：https://stackoverflow.com/questions/34023665/numpy-memmap-bogus-memory-allocation

标签

python

Linux

numpy

memory-mapped-files

sparse-file