garbage collection of shared data in multiprocessing via fork

a 夏天 提交于 2019-12-11 17:50:05

问题


I am doing some multiprocessing in linux, and I am using shared memory that is currently not explicitly passed to the child processes (not via an argument).

In the official python multiprocessing Programming guidelines at the "Explicitly pass resources to child processes" section it is written:

On Unix using the fork start method, a child process can make use of a shared resource created in a parent process using a global resource. However, it is better to pass the object as an argument to the constructor for the child process.... this ... ensures that as long as the child process is still alive the object will not be garbage collected in the parent process. This might be important if some resource is freed when the object is garbage collected in the parent process.

this explanation seems a bit lacking to me.

  1. When should i be worried about garbage collection?
  2. Should I always pass data to a child because otherwise sometimes there will be unexpected results, or is this only a best practice?

Right now I am not experiencing any unexpected garbage collection, however this situation seems precarious to me.


回答1:


This depends strongly on A) your data and B) your multiprocess method.

TLDR:

  • spawn objects are cloned and each is finalised in each process
  • fork/forkserver objects are shared and finalised in the main process

  • Some objects respond badly to being finalised in the main process while still used in child processes.

  • The docs on args are wrong as content of args is not kept alive by itself (3.7.0)

Note: Full code available as gist. All output from CPython 3.7.0 on macOS 10.13.

We start with a simple object that reports where and when it is finalised:

def print_pid(*args, **kwargs):  # Process aware print helper
    print('[%s]' % os.getpid(), *args, **kwargs)


class Finalisable:
    def __init__(self, name):
        self.name = name

    def __repr__(self):
        return '<Finalisable object %s at 0x%x>' % (getattr(self, 'name', 'unknown'), id(self))

    def __del__(self):
        print_pid('finalising', self)

Early collection from args

To test how args works for GC, we can build a process and immediately release its argument reference:

def drop_early():
    payload = Finalisable()
    child = multiprocessing.Process(target=print, args=(payload,))
    print('drop')
    del payload  # remove sole local reference for `args` content
    print('start')
    child.start()
    child.join()

With spawn method, the original is collected but the child has its own copy to finalise:

### test drop_early in 15333 method: spawn
drop
start
[15333] finalising <Finalisable object early at 0x102347390>
[15336] child sees <Finalisable object early at 0x109bd8128>
[15336] finalising <Finalisable object early at 0x109bd8128>
### done

With fork method, the original is finalised and the child receives this finalised object:

### test drop_early in 15329 method: fork
drop
start
[15329] finalising <Finalisable object early at 0x108b453c8>
[15331] child sees <Finalisable object early at 0x108b453c8>
### done

This shows that the payload of the main process is finalised before the child process runs and completes! Bottom line, args is not a guard against early collection!

Early collection of shared objects

Python has some types meant for safe sharing between processes. We can use this as our marker as well:

def drop_early_shared():
    payload = Finalisable(multiprocessing.Value('i', 65))
    child = multiprocessing.Process(target=print_pid, args=('child sees', payload,))
    print('drop')
    del payload
    print('start')
    child.start()
    child.join()

With the fork method, the Value is collected early but still functional:

### test drop_early_shared in 15516 method: fork
drop
start
[15516] finalising <Finalisable object <Synchronized wrapper for c_int(65)> at 0x1071a3e10>
[15519] child sees <Finalisable object <Synchronized wrapper for c_int(65)> at 0x1071a3e10>
### done

With the spawn method, the Value is collected early and entirely broken for the child:

### test drop_early_shared in 15520 method: spawn
drop
start
[15520] finalising <Finalisable object <Synchronized wrapper for c_int(65)> at 0x103a16c18>
[15524] finalising <Finalisable object unknown at 0x101aa0128>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/synchronize.py", line 111, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
### done

This shows that finalisation behaviour depends on your object and your environment. Bottom line, do not assume that your object is well-behaved!


While it is good practice to pass data via args, this does not free the main process from handling it! Objects might respond badly to early finalisation when the main process drops references.

As CPython uses fast-acting reference counting, you will see ill effects practically immediately. However, other implementations, e.g. PyPy, may hide such side-effects for an arbitrary time.



来源:https://stackoverflow.com/questions/52869833/garbage-collection-of-shared-data-in-multiprocessing-via-fork

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!