How to reduce wand memory usage?

前端 未结 4 1652
南方客
南方客 2021-01-13 09:36

I am using wand and pytesseract to get the text of pdfs uploaded to a django website like so:

image_pdf = Image(blob=read_pdf_file, resolution=300)
image_png         


        
4条回答
  •  [愿得一人]
    2021-01-13 09:56

    Remember that the wand library integrates with MagickWand API, and in turn, delegates PDF encoding/decoding work to ghostscript. Both MagickWand & ghostscript allocated additional memory resources, and do there best to deallocate at the end of each task. However, if routines are initialized by python, and held by a variable, it's more than possible to introduce memory-leaks.

    Here's some tips to ensure memory is managed correctly.

    1. Use with context management for all Wand assignments. This will ensure all resources pass through __enter__ & __exit__ management handlers.

    2. Avoid blob creation for passing data. When creating a file-format blob, MagickWand will allocated additional memory to copy & encode the image, and python will hold resulting data in addition to the originating wand instance. Usually fine on the dev environment, but can grow out of hand quickly in a production setting.

    3. Avoid Image.sequence. This is another copy-heavy routine, and results in python holding a bunch of memory resources. Remember ImageMagick manages the image stacks very well, so if you're not reordering / manipulating individual frames, it's best to use MagickWand methods & not involve python.

    4. Each task should be an isolated process, and can cleanly shut-down on completion. This shouldn't be an issue for you w/ celery as a queue worker, but worth double checking the thread/worker configuration + docs.

    5. Watch out for resolution. A pdf resolution of 300 @ 16Q would result in a massive raster image. With many OCR (tesseract/opencv) techniques, the first step is to pre-process the inbound data to remove extra/unneeded colors / channels / data / &tc.

    Here's an example of how I would approach this. Note, I'll leverage ctypes to directly manage the image stack w/o additional python resources.

    import ctyles
    from wand.image import Image
    from wand.api import library
    
    # Tell wand about C-API method
    library.MagickNextImage.argtypes = [ctypes.c_void_p]
    library.MagickNextImage.restype = ctypes.c_int
    
    # ... Skip to calling method ...
    
    final_text = []
    with Image(blob=read_pdf_file, resolution=100) as context:
        context.depth = 8
        library.MagickResetIterator(context.wand)
        while(library.MagickNextImage(context.wand) != 0):
            data = context.make_blob("RGB")
            text = pytesseract.image_to_string(data)
            final_text.append(text)
    return " ".join(final_text)
    

    Of course your milage may vary. If your comfortable with subprocess, you may be able to execute gs & tesseract directly, and eliminate all the python wrappers.

提交回复
热议问题