Apache MetaModel - bad performance querying spreadsheet

不羁的心 提交于 2019-12-18 09:43:12

问题


I need to query a spreadsheet file in Java. I'm using Apache MetaModel.

I imported it with maven using

<dependency>
    <groupId>org.apache.metamodel</groupId>
    <artifactId>MetaModel-excel</artifactId>
    <version>4.5.2</version>
</dependency>

Everything works fine but the next() instruction when it's supposed to return false takes a few seconds, why?

import org.apache.metamodel.DataContext;
import org.apache.metamodel.excel.ExcelDataContext;
import org.apache.metamodel.schema.Schema;
import org.apache.metamodel.schema.Column; 
import org.apache.metamodel.schema.Table; 
import org.apache.metamodel.query.Query;
import org.apache.metamodel.query.OperatorType; 
import org.apache.metamodel.data.DataSet; 
import org.apache.metamodel.data.Row; 
import org.apache.metamodel.MetaModelException;


public class SpreadsheetReader {

    private File spreadsheet;


    public SpreadsheetReader(String spreadsheetLocation, String spreadsheetName){

        this.spreadsheet = new File( spreadsheetLocation + spreadsheetName );

        if( !"OK".equals( checkSpreadSheet() ) ){
            throw new IllegalStateException("Error in spreadsheet. Cause: "+spreadsheetStatus);
        }

    }


    /** query the excel spreadsheet for the given  ID
    */
    public List<String> query( String givenProgId ){

        List<String> linksArray = new ArrayList<String>();
        int rowCount = 0;

        ExcelConfiguration conf = new ExcelConfiguration( 1, true, true ); // columnNameLineNumber, skipEmptyLines, sEColumns
        DataContext dataContext = new ExcelDataContext( this.spreadsheet, conf );

        System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" START-1" ); // ## 

        Schema schema = dataContext.getDefaultSchema();

        System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" STOP-1" ); // ## 
       // Takes 2 seconds. Will be moved into constructor.

        Table table = schema.getTables()[0];

        Column idsColumn = table.getColumnByName("ProgID");
        Column titlesColumn = table.getColumnByName("Titles");

        Query query = new Query().select(titlesColumn)
                                 .from(table)
                                 .where(idsColumn, OperatorType.EQUALS_TO, givenProgId);

        try( DataSet dataSet = dataContext.executeQuery(query) ){ // try-with-resource, no need to close dataset

            while (dataSet.next()) {

                // the rows are read quite quickly, problem will be when next() is false

                ++rowCount;

                Row currentRow = dataSet.getRow();
                String currentTitle = (String)currentRow.getValue(0);

                linksArray.add( "my-service/titles/"+currentTitle );

                System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" START-2" ); // @@@@@@@
            }
            System.out.println( "PROFILING >>> "+(new org.joda.time.DateTime())+" STOP-2" ); // @@@@@@@ 
            // TAKES ABOUT 6 SECONDS - (Excel file has just 14.779 rows and 114 columns)

        }catch(MetaModelException xx){
            //logger
            throw xx;
        }

        return linksArray;
    }
}}

UPDATE: Some more profiling with a spreadsheet document with just 3 entries:

Code now is:

try( DataSet dataSet = this.dataContext.executeQuery(query) ){


    // FIRST NEXT() with result => quite fast

    System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" START a\n" );
    System.out.println( "\n 88888 NEXT >>> "+(dataSet.next())+" <<<< \n" ); 
    Row currentRow = dataSet.getRow();
    String currentTitle = (String)currentRow.getValue(0);
    System.out.println( "\n READ: "+(new org.joda.time.DateTime())+" >>> "+currentTitle+" \n" );

    System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" STOP a\n" );


    // SECOND AND LAST NEXT() => very SLOW

    System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" START b\n" );
    System.out.println( "\n 88888 NEXT >>> "+(dataSet.next())+" <<<< \n" );
    System.out.println( "\n PROFILING >>> "+(new org.joda.time.DateTime())+" STOP b\n" );

}

And the spreadsheet is pre-loaded in the class constructor.

Logs (with times) for the last of a series of subsequent identical queries are:

Jun 30, 2016 10:59:38 AM log my-project.logging.ITVLogger
INFO: CODE00012 - Query on spreadsheet started for ID 123456



PROFILING >>> 2016-06-30T10:59:38.651+01:00 START a

10:59:38.652 [main] INFO  o.a.m.d.RowPublisherDataSet - 
Starting separate thread for publishing action: org.apache.metamodel.excel.XlsxRowPublisherAction@4977e527

88888 NEXT >>> true <<<< 

READ: 2016-06-30T10:59:39.756+01:00 >>> A_TITLE

PROFILING >>> 2016-06-30T10:59:39.756+01:00 STOP a



PROFILING >>> 2016-06-30T10:59:39.756+01:00 START b

88888 NEXT >>> false <<<< 

PROFILING >>> 2016-06-30T10:59:44.735+01:00 STOP b

So to recap it takes about one second to retrieve the result and 4~6 seconds for the last execution of next().


回答1:


The excel DataContext implementation does it's unwrapping and parsing of the zipped .xlsx file in the background. This means that the DataContext.executeQuery(...) method returns quickly but the DataSet.next() call that comes right after it has to wait for the data to be available in memory. I don't see any way that you can avoid that, it's just a consequence of Excel files being rather complex things to deal with.



来源:https://stackoverflow.com/questions/38105932/apache-metamodel-bad-performance-querying-spreadsheet

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!