Using POI or Tika to extract text, stream-to-stream without loading the entire file in memory
问题 I'm trying to use either Apache POI and PDFBox by themselves, or within the context of Apache Tika, to extract and process plain text from MASSIVE Microsoft Office and PDF files (i.e. hundreds of megs in some cases). Also, my application is multi-threaded, so I will be parsing many of these large files concurrently. At that scale, I MUST work with the files in a streaming manner. It's not an option to hold an entire file in main memory at any step along the way. I have seen many source code