How to reduce wand memory usage?

前端未结

关注

 4  1663

I am using wand and pytesseract to get the text of pdfs uploaded to a django website like so:

image_pdf = Image(blob=read_pdf_file, resolution=300)
image_png


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情深已故        
                
              
                            
                2021-01-13 09:38
              
            
            
                                                                       
The code from @emcconville works, and my temp folder is not filling up with magick-* files anymore

I needed to Import ctypes and not cstyles

I also got the error mentioned by @kerthik

solved it by saving the image and loading it again, it is properly also possible to save it to memory

from PIL import Image as PILImage

...
context.save(filename="temp.jpg")
text = pytesseract.image_to_string(PILImage.open("temp.jpg"))`


EDIT
I found the in memory conversion on How to convert wand.image.Image to PIL.Image?

img_buffer = np.asarray(bytearray(context.make_blob(format='png')),dtype='uint8')
bytesio = io.BytesIO(img_buffer)
text = ytesseract.image_to_string(PILImage.open(bytesio),lang="dan")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2021-01-13 09:56
              
            
            
                                                                       
Remember that the wand library integrates with MagickWand API, and in turn, delegates PDF encoding/decoding work to ghostscript. Both MagickWand & ghostscript allocated additional memory resources, and do there best to deallocate at the end of each task. However, if routines are initialized by python, and held by a variable, it's more than possible to introduce memory-leaks.

Here's some tips to ensure memory is managed correctly.


Use with context management for all Wand assignments. This will ensure all resources pass through __enter__ & __exit__ management handlers.
Avoid blob creation for passing data. When creating a file-format blob, MagickWand will allocated additional memory to copy & encode the image, and python will hold resulting data in addition to the originating wand instance. Usually fine on the dev environment, but can grow out of hand quickly in a production setting.
Avoid Image.sequence. This is another copy-heavy routine, and results in python holding a bunch of memory resources. Remember ImageMagick manages the image stacks very well, so if you're not reordering / manipulating individual frames, it's best to use MagickWand methods & not involve python.
Each task should be an isolated process, and can cleanly shut-down on completion. This shouldn't be an issue for you w/ celery as a queue worker, but worth double checking the thread/worker configuration + docs.
Watch out for resolution. A pdf resolution of 300 @ 16Q would result in a massive raster image. With many OCR (tesseract/opencv) techniques, the first step is to pre-process the inbound data to remove extra/unneeded colors / channels / data / &tc.


Here's an example of how I would approach this. Note, I'll leverage ctypes to directly manage the image stack w/o additional python resources. 

import ctyles
from wand.image import Image
from wand.api import library

# Tell wand about C-API method
library.MagickNextImage.argtypes = [ctypes.c_void_p]
library.MagickNextImage.restype = ctypes.c_int

# ... Skip to calling method ...

final_text = []
with Image(blob=read_pdf_file, resolution=100) as context:
    context.depth = 8
    library.MagickResetIterator(context.wand)
    while(library.MagickNextImage(context.wand) != 0):
        data = context.make_blob("RGB")
        text = pytesseract.image_to_string(data)
        final_text.append(text)
return " ".join(final_text)


Of course your milage may vary. If your comfortable with subprocess, you may be able to execute gs & tesseract directly, and eliminate all the python wrappers.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  谎友^        
                
              
                            
                2021-01-13 09:57
              
            
            
                                                                       
I run into a similar issue.
Found this page interesting: http://www.imagemagick.org/script/architecture.php#tera-pixel
And how to limit the amount of memory used by ImageMagick through wand:
http://docs.wand-py.org/en/latest/wand/resource.html
Just adding something like:
from wand.resource import limits

# Use 100MB of ram before writing temp data to disk.
limits['memory'] = 1024 * 1024 * 100

It may increase the computation time (but like you, I don't mind too much) and I actually did not notice so much difference.
I confirmed using Python's memory-profiler that it is working as expected.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情话喂你        
                
              
                            
                2021-01-13 10:02
              
            
            
                                                                       
I was also suffering from memory leaks issues. After some research and tweaking the code implementation, my issues were resolved.
I basically worked correctly using with and destroy() function.

In some cases I could use  with  to open and read the files, as in the example below:

with Image(filename = pdf_file, resolution = 300) as pdf:


This case, using with, the memory and tmp files are correctly managed.

And in another case I had to use the destroy() function, preferably inside a try / finally block, as below:

try:
    for img in pdfImg.sequence:
    # your code
finally:
    pdfImg.destroy()


The second case, is an example where I cann't use with because I had to iterate the pages through the sequence, so, I already had the file open and was iterating your pages.

This conbination of solution resolved my problems with memory leaks.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复