I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:
# using python
import xlrd
wb = xlrd.open_
I have created an sample Java program which is able to load the file in ~40 seconds my laptop ( Intel i7 4 core, 16 GB RAM).
https://github.com/skadyan/largefile
This program uses the Apache POI library to load the .xlsx file using the XSSF SAX API.
The callback interface com.stackoverlfow.largefile.RecordHandler
implementation can be used to process the data loaded from the excel. This interface define only one method which take three arguments
data map
: Map: excel cell reference and excel formatted cell valueThe class com.stackoverlfow.largefile.Main
demonstrate one basic implementation of this interface which just print the row number on console.
Update
woodstox parser seems have better performance than standard SAXReader
. (code updated in repo).
Also in order to meet the desired performance requirement, you may consider to re-implement the org.apache.poi...XSSFSheetXMLHandler
. In the implementation, more optimized string/text value handling can be implemented and unnecessary text formatting operation may be skipped.