How to read PDF files and xml files in Apache Spark scala?

后端未结

关注

 3  2118

有刺的猬 2020-12-19 21:35

My sample code for reading text file is

val text = sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitio


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   感动是毒
                                             
                
                
                (楼主)
            
              
              
                2020-12-19 21:50
              

            
            
                        
PDF can be parse in pyspark as follow:

If PDF is store in HDFS then using sc.binaryFiles() as PDF is store in binary format.
Then the binary content can be send to pdfminer for parsing. 

import pdfminer
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

def return_device_content(cont):
    fp = io.BytesIO(cont)
    parser = PDFParser(fp)
    document = PDFDocument(parser)

filesPath="/user/root/*.pdf"
fileData = sc.binaryFiles(filesPath)
file_content = fileData.map(lambda content : content[1])
file_content1 = file_content.map(return_device_content)


Further parsing is can be done using functionality provided by pdfminer. 
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复