Whitespace gone from PDF extraction, and strange word interpretation

前端未结

关注

 6  2093

爱一瞬间的悲伤 2020-12-01 11:26

Using the snippet below, I\'ve attempted to extract the text data from this PDF file.

import pyPdf

def get_text(path):
    # Load PDF into pyPDF
    pdf = p


      
      
        
          6条回答        

        
                    
            
            
                         
                
              
              
                
                   时光取名叫无心
                                             
                
                
                (楼主)
            
              
              
                2020-12-01 12:03
              

            
            
                        
PDFBox is a pretty good tool for extracting text from PDF files using Java.  Text extraction is its strength; if you want to modify/annotate or view PDF files, another tool might serve you better.  It has code for identifying spaces in files.

It also has code for handling ligatures, but you need to have a certain internationalization library on the classpath for that to work -- Icu4j.  

You could call the PDFBox text extractor from Python as a command-line program, without writing any Java code.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它6个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复