Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
问题 Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark? For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data? 回答1: Yes, it is. Use sparkContext.binaryFiles to load files in binary format and then use map to map value to some other format - for example, parse binary with Apache Tika or Apache POI. Pseudocode: val rawFile = sparkContext.binaryFiles(...