How to read PDF files and xml files in Apache Spark scala?

后端 未结 3 2118
有刺的猬
有刺的猬 2020-12-19 21:35

My sample code for reading text file is

val text = sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitio         


        
3条回答
  •  感动是毒
    2020-12-19 21:50

    PDF can be parse in pyspark as follow:

    If PDF is store in HDFS then using sc.binaryFiles() as PDF is store in binary format. Then the binary content can be send to pdfminer for parsing.

    import pdfminer
    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfdocument import PDFDocument
    
    def return_device_content(cont):
        fp = io.BytesIO(cont)
        parser = PDFParser(fp)
        document = PDFDocument(parser)
    
    filesPath="/user/root/*.pdf"
    fileData = sc.binaryFiles(filesPath)
    file_content = fileData.map(lambda content : content[1])
    file_content1 = file_content.map(return_device_content)
    

    Further parsing is can be done using functionality provided by pdfminer.

提交回复
热议问题