Best language to parse extremely large Excel 2007 files [closed]

问题

My boss has a habit of performing queries on our databases that return tens of thousands of rows and saving them into excel files. I, being the intern, constantly have to write scripts that work with the information from these files. Thus far I've tried VBScript and Powershell for my scripting needs. Both of these can take several minutes to perform even the simplest of tasks, which would mean that the script when finished would take most of an 8 hour day.

My workaround right now is simply to write a PowerShell script that removes all of the commas and newline characters from an xlsx file, saves the .xlsx files to .csv, and then have a Java program handle the data gathering and output, and have my script clean up the .csv files when finished. This runs in a matter of seconds for my current project, but I can't help but wonder if there's a more elegant alternative for my next one. Any suggestions?

回答1:

I kept getting all kinds of weird errors when working with .xlsx files.

Here's a simple example of using Apache POI to traverse an .xlsx file. See also Upgrading to POI 3.5, including converting existing HSSF Usermodel code to SS Usermodel (for XSSF and HSSF).

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.DateUtil;
import org.apache.poi.ss.usermodel.FormulaEvaluator;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

public class XlsxReader {

    public static void main(String[] args) throws IOException {
        InputStream myxls = new FileInputStream("test.xlsx");
        Workbook book = new XSSFWorkbook(myxls);
        FormulaEvaluator eval =
            book.getCreationHelper().createFormulaEvaluator();
        Sheet sheet = book.getSheetAt(0);
        for (Row row : sheet) {
            for (Cell cell : row) {
                printCell(cell, eval);
                System.out.print("; ");
            }
            System.out.println();
        }
        myxls.close();
    }

    private static void printCell(Cell cell, FormulaEvaluator eval) {
        switch (cell.getCellType()) {
            case Cell.CELL_TYPE_BLANK:
                System.out.print("EMPTY");
                break;
            case Cell.CELL_TYPE_STRING:
                System.out.print(cell.getStringCellValue());
                break;
            case Cell.CELL_TYPE_NUMERIC:
                if (DateUtil.isCellDateFormatted(cell)) {
                    System.out.print(cell.getDateCellValue());
                } else {
                    System.out.print(cell.getNumericCellValue());
                }
                break;
            case Cell.CELL_TYPE_BOOLEAN:
                System.out.print(cell.getBooleanCellValue());
                break;
            case Cell.CELL_TYPE_FORMULA:
                System.out.print(cell.getCellFormula());
                break;
            default:
                System.out.print("DEFAULT");
        }
    }
}

回答2:

Your goal is to do "data transformation" on your Excel files.

To solve this, I would use a dedicated ETL tool (Extract Transform Load), such as Talend Open Studio.

You just have to put a "Excel Input" component, a "data transform" component, and a "CSV output component". Talend ETL will convert this functional description of your problem into a Java code. Finally, you just have to execute this program...

回答3:

I personally would use Python for this. I have found that it runs fast enough to not be a noticeable problem.

If you don't want to worry about a new language, why not just use Java for the entire thing? Removing commas and newlines is pretty trivial in Java and it would save you a step.

回答4:

You should always think about the future of your code...

Who will maintain your script in the future? Does your company have any other developers that are familiar with PowerShell/VBScript?

I would have to say that you should stick to one language that fits your (and your company's) needs. As Nathan suggested, Python would be a great choice for creating fast scripts.

And one more thing - If you can control the SQL statements your boss does, you can make him create outputs that will ease your parsers' development and make them much more simple.

Good luck!

Tal.

回答5:

in addition to trashgod's answer, for large files, I'd suggest POI SXSSF (Since POI 3.8 beta3). (http://poi.apache.org/spreadsheet/) With SXSSF, you can handle large files in streams, and that will help avoiding memory errors.

adding link to SXSSF details: http://poi.apache.org/spreadsheet/how-to.html#sxssf

回答6:

You could use Java POI to access the .xlsx directly.

回答7:

If you save the file as a CSV, you can use any language you want to parse it.

回答8:

You can import data into an embedded database - e.g., apache derby (or http://hsqldb.org/). Depending on the nature of your queries, it can be a little bit faster. For sure, it will save a lot of your time if your boss requests new features often. You will simply write most of your new functionality in SQL.

回答9:

If you need an ADVANCE analysis -- beyond grouping, joining, filtering --, just go for free mining tools, such as Wekka*, Rapid miner (based on Wekka but nicer GUI), or knime. These tools have very nice interfaces and provide operators to read cvs files. You can also run rapidminer and wekka libraries inside your java program. If not, go for embedded database as I proposed before.

Using Apache POI is not bad idea but I -- personally -- prefer to use it only to read excel before uploading it into e.g., a database.

Regarding the language. The best language that I have found for adhoc tasks is groovy. It is scripting language on the top of Java so you can use all Java libs (POI, jdbcs, ...a very looong list) and mix groovy classes with Java classes.

回答10:

I have two options for parsing excel(.xlsx or xls) files. 1-You can use apache POI api to extract data from it.Now Apache poi has improved and fast.

2- Convert excel to open xml then write a xslt file. I think it should do work for a long file excel file.

来源：https://stackoverflow.com/questions/3560950/best-language-to-parse-extremely-large-excel-2007-files

标签

java

excel

scripting

excel-2007

apache-poi