Text From PDF in Spark

问题

I'm trying to extract text from pdf files in hdfs using pdfBox.

However it throws an error:

"Exception in thread "main" org.apache.spark.SparkException: ...
java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf 
(No such file or directory)"

What am I missing? Should I be working with PortableDataStream instead of the string part of:

val files: RDD[(String, PortableDataStream)]?

def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: SparkSession) = {
val file: File = new File(fileNameFromRDD._1.drop(5))
val document = PDDocument.load(file); //It throws an error here.

if (!document.isEncrypted()) {
  val stripper = new PDFTextStripper()
  val text = stripper.getText(document)
  println("Text:" + text)

}
    document.close()

  }

//This is where I call the above pdf to text converter method.
     val files = sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")
    files.foreach(println)

    files.foreach(f => println(f._1))

    files.foreach(fileStream => pdfRead(fileStream, sparkSession))

来源：https://stackoverflow.com/questions/52546241/text-from-pdf-in-spark

标签

apache-spark

pdf

pdfbox

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!