Classpath issues running Tika on Spark

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-06 02:50:00

I already found a similar problem here (Apache Tika 1.11 on Spark NoClassDeftFoundError) where the solution was to build a fat jar. But I would like to know if there is any other way so solve the dependency issues?

Find all the dependencies and add them to --jars. You can do it with https://github.com/jrudolph/sbt-dependency-graph. But I don't see why you'd prefer this to building one jar combining them all.

I ran it without any arguments and it worked perfectly.

SBT already ensures you have all the dependencies on the classpath, but Spark doesn't use SBT to run your program.

The issues came from version mismatches in the jars. I decided on the following sbt file which solves my problem:

name := "TikaFileParser"
version := "0.1"
scalaVersion := "2.11.7"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" % "provided"
libraryDependencies += "org.apache.tika" % "tika-core" % "1.11"
libraryDependencies += "org.apache.tika" % "tika-parsers" % "1.11"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.1" % "provided"

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
  {
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard
    case _     => MergeStrategy.first
  }
}

I want to correct the answer of @flowit, as it threw me into a long day of investigations.

The problem with the answer is the merge strategy, which discards every META-INF directory. Yet, this will also get rid of the META-INF/services directory where Tika is registering i.a. its parsers.

Using the merge strategy, which you can find in the accepted answer or in other Stackoverflow answers that are flying around, you will end up with empty content, as Tika will default to the EmptyParser. So, if you try to parse anything, Tika will not be able to resolve the parsers. See https://tika.apache.org/1.21/configuring.html#Static.

The solution for me was (using a newer sbt-syntax I guess):

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) =>
    (xs map {_.toLowerCase}) match {
      case "services" :: xs => MergeStrategy.concat // Tika uses the META-INF/services to register its parsers statically, don't discard it
      case _ => MergeStrategy.discard
    }
  case x => MergeStrategy.first
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!