Unable to pass an lxml etree object to a separate process

≡放荡痞女 提交于 2019-11-29 21:18:50

问题


I'm working on a project to parse multiple xml files concurrently in python using lxml. When I initialize the process I want my main class to do some work on the XML before it passes the etree object to the process, but I am finding that when the etree object arrives in the new process the class survives but the XML is gone from within the object and getroot() returns None.

I know that I can only pass pickable data using the queue, but is this also the case with what I pass to the process inside the 'args' field?

Here's my code:

import multiprocessing, multiprocessing.pool, time
from lxml import etree

def compute(tree):
    print("Start Process")
    print(type(tree)) # Returns <class 'lxml.etree._ElementTree'>
    print(id(tree)) # Returns new ID 44637320 as expected
    print(tree.getroot()) # Returns None

def pool_init(queue):
    # see http://stackoverflow.com/a/3843313/852994
    compute.queue = queue

class Main():
    def __init__(self):
        pass

    def main(self):
        tree = etree.parse('test.xml')
        print(id(tree)) # Returns object ID 43998536
        print(tree.getroot()) #Returns <Element SymCLI_ML at 0x29f5dc8>

        self.queue = multiprocessing.Queue()
        self.pool = multiprocessing.Pool(processes=1, initializer=pool_init, initargs=(self.queue,))
        self.pool.apply_async(func=compute, args=(tree,))
        time.sleep(10)

if __name__ == '__main__':
    Main().main()

Any and all help much appreciated.

UPDATE/ANSWER

Based on the answer in the next post down I've modified it a bit and managed to get it working with a much lower memory footprint without using String IO. The etree.tostring method returns a byte array, which can be pickled, then to unpickle it the byte array can be parsed by etree.

import multiprocessing, multiprocessing.pool, time, copyreg
from lxml import etree

def compute(tree):
    print("Start Process")
    print(type(tree)) # Returns <class 'lxml.etree._ElementTree'>
    print(tree.getroot()) # Returns <Element SymCLI_ML at 0x29f5dc8>. Success!

def pool_init(queue):
    # see http://stackoverflow.com/a/3843313/852994
    compute.queue = queue

def elementtree_unpickler(data):
    return etree.parse(BytesIO(data))

def elementtree_pickler(tree):
    return elementtree_unpickler, (etree.tostring(tree),)

copyreg.pickle(etree._ElementTree, elementtree_pickler, elementtree_unpickler)

class Main():
    def __init__(self):
        pass

    def main(self):
        tree = etree.parse('test.xml')
        print(tree.getroot()) #Returns <Element SymCLI_ML at 0x29f5dc8>

        self.queue = multiprocessing.Queue()
        self.pool = multiprocessing.Pool(processes=1, initializer=pool_init, initargs=(self.queue,))
        self.pool.apply_async(func=compute, args=(tree,))
        time.sleep(10)

if __name__ == '__main__':
    Main().main()

UPDATE 2

After doing some bench-marking with memory I found that passing large objects causes the objects to not be able to be cleared up by garbage collection on the main process. This probably isn't an issue at small scale, but by etree objects were in the order of multiple hundreds of MB in memory. As soon as an async task has been called with an XML object in the statement, that object cannot be cleared from memory if it is deleted from the main process, even my manually invoking garbage collection. So as a result I've reverted to closing the XML in the main process and passing the file name to the sub-process.


回答1:


Use the following code to register simple picklers/unpicklers for lxml Element/ElementTree objects. I used that in the past with lxml and zmq.

import copy_reg
try:
    from cStringIO import StringIO
except ImportError:
    from StringIO import StringIO
from lxml import etree

def element_unpickler(data):
    return etree.fromstring(data)

def element_pickler(element):
    data = etree.tostring(element)
    return element_unpickler, (data,)

copy_reg.pickle(etree._Element, element_pickler, element_unpickler)

def elementtree_unpickler(data):
    data = StringIO(data)
    return etree.parse(data)

def elementtree_pickler(tree):
    data = StringIO()
    tree.write(data)
    return elementtree_unpickler, (data.getvalue(),)

copy_reg.pickle(etree._ElementTree, elementtree_pickler, elementtree_unpickler)


来源:https://stackoverflow.com/questions/25991860/unable-to-pass-an-lxml-etree-object-to-a-separate-process

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!