How to read PDF files and xml files in Apache Spark scala?

后端 未结 3 2117
有刺的猬
有刺的猬 2020-12-19 21:35

My sample code for reading text file is

val text = sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitio         


        
3条回答
  •  盖世英雄少女心
    2020-12-19 21:45

    You can simply use spark-shell with tika and run the below code in a sequential manner or in a distributed manner depending upon your use case

    spark-shell --jars tika-app-1.8.jar
    val binRDD = sc.binaryFiles("/data/")
    val textRDD = binRDD.map(file => {new org.apache.tika.Tika().parseToString(file._2.open( ))})
    textRDD.saveAsTextFile("/output/")
    System.exit(0)
    

提交回复
热议问题