Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark? For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data?
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
Yes, it is. Use sparkContext.binaryFiles
to load files in binary format and then use map
to map value to some other format - for example, parse binary with Apache Tika or Apache POI.
Pseudocode:
val rawFile = sparkContext.binaryFiles(... val ready = rawFile.map ( here parsing with other framework
What is important, parsing must be done with other framework like mentioned previously in my answer. Map will get InputStream as an argument