My sample code for reading text file is
val text = sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitio
PDF can be parse in pyspark as follow:
If PDF is store in HDFS then using sc.binaryFiles() as PDF is store in binary format. Then the binary content can be send to pdfminer for parsing.
import pdfminer
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
def return_device_content(cont):
fp = io.BytesIO(cont)
parser = PDFParser(fp)
document = PDFDocument(parser)
filesPath="/user/root/*.pdf"
fileData = sc.binaryFiles(filesPath)
file_content = fileData.map(lambda content : content[1])
file_content1 = file_content.map(return_device_content)
Further parsing is can be done using functionality provided by pdfminer.