问题
I need to query a spreadsheet file in Java. I'm using Apache MetaModel.
I imported it with maven using
<dependency>
<groupId>org.apache.metamodel</groupId>
<artifactId>MetaModel-excel</artifactId>
<version>4.5.2</version>
</dependency>
Everything works fine but the next()
instruction when it's supposed to return false takes a few seconds, why?
import org.apache.metamodel.DataContext;
import org.apache.metamodel.excel.ExcelDataContext;
import org.apache.metamodel.schema.Schema;
import org.apache.metamodel.schema.Column;
import org.apache.metamodel.schema.Table;
import org.apache.metamodel.query.Query;
import org.apache.metamodel.query.OperatorType;
import org.apache.metamodel.data.DataSet;
import org.apache.metamodel.data.Row;
import org.apache.metamodel.MetaModelException;
public class SpreadsheetReader {
private File spreadsheet;
public SpreadsheetReader(String spreadsheetLocation, String spreadsheetName){
this.spreadsheet = new File( spreadsheetLocation + spreadsheetName );
if( !"OK".equals( checkSpreadSheet() ) ){
throw new IllegalStateException("Error in spreadsheet. Cause: "+spreadsheetStatus);
}
}
/** query the excel spreadsheet for the given ID
*/
public List<String> query( String givenProgId ){
List<String> linksArray = new ArrayList<String>();
int rowCount = 0;
ExcelConfiguration conf = new ExcelConfiguration( 1, true, true ); // columnNameLineNumber, skipEmptyLines, sEColumns
DataContext dataContext = new ExcelDataContext( this.spreadsheet, conf );
System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" START-1" ); // ##
Schema schema = dataContext.getDefaultSchema();
System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" STOP-1" ); // ##
// Takes 2 seconds. Will be moved into constructor.
Table table = schema.getTables()[0];
Column idsColumn = table.getColumnByName("ProgID");
Column titlesColumn = table.getColumnByName("Titles");
Query query = new Query().select(titlesColumn)
.from(table)
.where(idsColumn, OperatorType.EQUALS_TO, givenProgId);
try( DataSet dataSet = dataContext.executeQuery(query) ){ // try-with-resource, no need to close dataset
while (dataSet.next()) {
// the rows are read quite quickly, problem will be when next() is false
++rowCount;
Row currentRow = dataSet.getRow();
String currentTitle = (String)currentRow.getValue(0);
linksArray.add( "my-service/titles/"+currentTitle );
System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" START-2" ); // @@@@@@@
}
System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" STOP-2" ); // @@@@@@@
// TAKES ABOUT 6 SECONDS - (Excel file has just 14.779 rows and 114 columns)
}catch(MetaModelException xx){
//logger
throw xx;
}
return linksArray;
}
}}
UPDATE: Some more profiling with a spreadsheet document with just 3 entries:
Code now is:
try( DataSet dataSet = this.dataContext.executeQuery(query) ){
// FIRST NEXT() with result => quite fast
System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" START a\n" );
System.out.println( "\n 88888 NEXT >>> "+(dataSet.next())+" <<<< \n" );
Row currentRow = dataSet.getRow();
String currentTitle = (String)currentRow.getValue(0);
System.out.println( "\n READ: "+(new org.joda.time.DateTime())+" >>> "+currentTitle+" \n" );
System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" STOP a\n" );
// SECOND AND LAST NEXT() => very SLOW
System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" START b\n" );
System.out.println( "\n 88888 NEXT >>> "+(dataSet.next())+" <<<< \n" );
System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" STOP b\n" );
}
And the spreadsheet is pre-loaded in the class constructor.
Logs (with times) for the last of a series of subsequent identical queries are:
Jun 30, 2016 10:59:38 AM log my-project.logging.ITVLogger
INFO: CODE00012 - Query on spreadsheet started for ID 123456
PROFILING >>> 2016-06-30T10:59:38.651+01:00 START a
10:59:38.652 [main] INFO o.a.m.d.RowPublisherDataSet -
Starting separate thread for publishing action: org.apache.metamodel.excel.XlsxRowPublisherAction@4977e527
88888 NEXT >>> true <<<<
READ: 2016-06-30T10:59:39.756+01:00 >>> A_TITLE
PROFILING >>> 2016-06-30T10:59:39.756+01:00 STOP a
PROFILING >>> 2016-06-30T10:59:39.756+01:00 START b
88888 NEXT >>> false <<<<
PROFILING >>> 2016-06-30T10:59:44.735+01:00 STOP b
So to recap it takes about one second to retrieve the result and 4~6 seconds for the last execution of next()
.
回答1:
The excel DataContext implementation does it's unwrapping and parsing of the zipped .xlsx file in the background. This means that the DataContext.executeQuery(...)
method returns quickly but the DataSet.next()
call that comes right after it has to wait for the data to be available in memory. I don't see any way that you can avoid that, it's just a consequence of Excel files being rather complex things to deal with.
来源:https://stackoverflow.com/questions/38105932/apache-metamodel-bad-performance-querying-spreadsheet