Classpath issues running Tika on Spark

邮差的信 提交于 2019-12-07 15:53:44

问题


I try to process a bunch of files in Tika. The number of files is in the thousands so I decided to build an RDD of files and let Spark distribute the workload. Unfortunatly I get multiple NoClassDefFound Exceptions.

This is my sbt file:

name := "TikaFileParser"
version := "0.1"
scalaVersion := "2.11.7"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" % "provided"
libraryDependencies += "org.apache.tika" % "tika-core" % "1.11"
libraryDependencies += "org.apache.tika" % "tika-parsers" % "1.11"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.1" % "provided"

This is my assembly.sbt

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1")

And this is the source file:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.input.PortableDataStream
import org.apache.tika.metadata._
import org.apache.tika.parser._
import org.apache.tika.sax.WriteOutContentHandler
import java.io._

object TikaFileParser {

  def tikaFunc (a: (String, PortableDataStream)) = {

    val file : File = new File(a._1.drop(5))
    val myparser : AutoDetectParser = new AutoDetectParser()
    val stream : InputStream = new FileInputStream(file)
    val handler : WriteOutContentHandler = new WriteOutContentHandler(-1)
    val metadata : Metadata = new Metadata()
    val context : ParseContext = new ParseContext()

    myparser.parse(stream, handler, metadata, context)

    stream.close

    println(handler.toString())
    println("------------------------------------------------")
  }


  def main(args: Array[String]) {

    val filesPath = "/home/user/documents/*"
    val conf = new SparkConf().setAppName("TikaFileParser")
    val sc = new SparkContext(conf)
    val fileData = sc.binaryFiles(filesPath)
    fileData.foreach( x => tikaFunc(x))
  }
}

I am running this with

spark-submit --driver-memory 2g --class TikaFileParser --master local[4]
             /path/to/TikaFileParser-assembly-0.1.jar

And get java.lang.NoClassDefFoundError: org/apache/cxf/jaxrs/ext/multipart/ContentDisposition which is a dependency of a parser. Out of curiosity I added the jar containing this class to Spark's --jars option and ran again. This time I got a new NoClassDefFoundError (can't remember which one, but also a Tika dependency).

I already found a similar problem here (Apache Tika 1.11 on Spark NoClassDeftFoundError) where the solution was to build a fat jar. But I would like to know if there is any other way so solve the dependency issues?

Btw: I tried this snippet without Spark (so just use an Array with the file names and a foreach loop and changed the tikaFunc signature accordingly. I ran it without any arguments and it worked perfectly.

Edit: Updateded the snippets now for use with sbt assembly.


回答1:


I already found a similar problem here (Apache Tika 1.11 on Spark NoClassDeftFoundError) where the solution was to build a fat jar. But I would like to know if there is any other way so solve the dependency issues?

Find all the dependencies and add them to --jars. You can do it with https://github.com/jrudolph/sbt-dependency-graph. But I don't see why you'd prefer this to building one jar combining them all.

I ran it without any arguments and it worked perfectly.

SBT already ensures you have all the dependencies on the classpath, but Spark doesn't use SBT to run your program.




回答2:


The issues came from version mismatches in the jars. I decided on the following sbt file which solves my problem:

name := "TikaFileParser"
version := "0.1"
scalaVersion := "2.11.7"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" % "provided"
libraryDependencies += "org.apache.tika" % "tika-core" % "1.11"
libraryDependencies += "org.apache.tika" % "tika-parsers" % "1.11"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.1" % "provided"

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
  {
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard
    case _     => MergeStrategy.first
  }
}



回答3:


I want to correct the answer of @flowit, as it threw me into a long day of investigations.

The problem with the answer is the merge strategy, which discards every META-INF directory. Yet, this will also get rid of the META-INF/services directory where Tika is registering i.a. its parsers.

Using the merge strategy, which you can find in the accepted answer or in other Stackoverflow answers that are flying around, you will end up with empty content, as Tika will default to the EmptyParser. So, if you try to parse anything, Tika will not be able to resolve the parsers. See https://tika.apache.org/1.21/configuring.html#Static.

The solution for me was (using a newer sbt-syntax I guess):

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) =>
    (xs map {_.toLowerCase}) match {
      case "services" :: xs => MergeStrategy.concat // Tika uses the META-INF/services to register its parsers statically, don't discard it
      case _ => MergeStrategy.discard
    }
  case x => MergeStrategy.first
}


来源:https://stackoverflow.com/questions/34293027/classpath-issues-running-tika-on-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!