How to extract a list of external references from a excel file

故事扮演 提交于 2020-03-20 06:43:46

问题


I'm working on setting up an automated processing system for a (ever growing) unstructured collection of excel documents. The collection consists of both old-school .xls files and new .xlsx files. In my Java-based solution I am already making use of the Apache POI toolkit to analyse the documents.

One challenges that I have not been able to tackle yet, is how to identify links between documents so as to chart dependencies. I have not yet been able to figure out how to conveniently extract a list of external references. For .xlsx files I have a workaround in place that unzips the file, and opens the xml file holding the references. This works but is inefficient for large document collections, and also does not provide a solution for .xls files.

I prefer to have a solution that is not dependent on Microsoft Office or associated libraries as the solution needs to run on a Linux environment.

Is POI capable of doing this somehow? If not, what would be suggested libraries/tools/area's that I could further investigate?


回答1:


Ultimately I worked my way through the POI source code and used reflection to get a list of referenced external workbooks. The following code was tested to work on POI version 3.11 beta.

Note for people looking to use this method in there code: Because it deals with non-public methods and classes, it is subject to change and may break in the future.

private LinkedList<String> getWorkbookReferences(HSSFWorkbook wb) {
    LinkedList<String> references = new LinkedList<>();

    try {
        // 1. Get InternalWorkbook
        Field internalWorkbookField = HSSFWorkbook.class.getDeclaredField("workbook");
        internalWorkbookField.setAccessible(true);
        InternalWorkbook internalWorkbook = (InternalWorkbook) internalWorkbookField.get(wb);

        // 2. Get LinkTable (hidden class)
        Method getLinkTableMethod;
        getLinkTableMethod = InternalWorkbook.class.getDeclaredMethod("getOrCreateLinkTable", null);

        getLinkTableMethod.setAccessible(true);
        Object linkTable = getLinkTableMethod.invoke(internalWorkbook, null);

        // 3. Get external books method
        Method externalBooksMethod = linkTable.getClass().getDeclaredMethod("getExternalBookAndSheetName", int.class);
        externalBooksMethod.setAccessible(true);

        // 4. Loop over all possible workbooks
        int i = 0;
        String[] names;
        try {
            while( true) {
                names = (String[]) externalBooksMethod.invoke(linkTable, i++) ;                     if (names != null ) {
                    references.add(names[0]);
                }
            }
        }
        catch  ( java.lang.reflect.InvocationTargetException e) {
            if ( !(e.getCause() instanceof java.lang.IndexOutOfBoundsException) ) {
                throw e;
            }
        }
    } catch (NoSuchFieldException | NoSuchMethodException | SecurityException | InvocationTargetException | IllegalAccessException e) {
        e.printStackTrace();
    }

    return references;
}


来源:https://stackoverflow.com/questions/26758099/how-to-extract-a-list-of-external-references-from-a-excel-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!