How to retrieve tables which exists in a pdf using AWS Textract in java

£可爱£侵袭症+ 提交于 2020-06-25 19:00:39

问题


I found article below to do in python.

https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html

also I used article below to extract text.

https://docs.aws.amazon.com/textract/latest/dg/detecting-document-text.html

but above article helped to get only text, I also used function "block.getBlockType()" of Block but none of block returned its type as "CELL" even tables are there in image/pdf.

Help me found java library similar to "boto3" to extract all tables.


回答1:


What I did, I created models of each dataset in the json response and can use this models to build a table view in jsf.

public static List<TableModel> getTablesFromTextract(TextractModel textractModel) {
    List<TableModel> tables = null;

    try {

        if (textractModel != null) {
            tables = new ArrayList<>();
            List<BlockModel> tableBlocks = new ArrayList<>();
            Map<String, BlockModel> blockMap = new HashMap<>();

            for (BlockModel block : textractModel.getBlocks()) {

                if (block.getBlockType().equals("TABLE")) {
                    tableBlocks.add(block);

                }
                blockMap.put(block.getId(), block);
            }

            for (BlockModel blockModel : tableBlocks) {

                Map<Long, Map<Long, String>> rowMap = new HashMap<>();

                for (RelationshipModel relationship : blockModel.getRelationships()) {

                    if (relationship.getType().equals("CHILD")) {

                        for (String id : relationship.getIds()) {

                            BlockModel cell = blockMap.get(id);

                            if (cell.getBlockType().equals("CELL")) {

                                long rowIndex = cell.getRowIndex();
                                long columnIndex = cell.getColumnIndex();

                                if (!rowMap.containsKey(rowIndex)) {
                                    rowMap.put(rowIndex, new HashMap<>());
                                }

                                Map<Long, String> columnMap = rowMap.get(rowIndex);
                                columnMap.put(columnIndex, getCellText(cell, blockMap));
                            }
                        }
                    }
                }
                tables.add(new TableModel(blockModel, rowMap));
            }
            System.out.println("row Map " + tables.toString());
        }
    } catch (Exception e) {
        LOG.error("Could not get table from textract model", e);
    }
    return tables;
}

private static String getCellText(BlockModel cell, Map<String, BlockModel> blockMap) {
    String text = "";

    try {

        if (cell != null
                && CollectionUtils.isNotEmpty(cell.getRelationships())) {

            for (RelationshipModel relationship : cell.getRelationships()) {

                if (relationship.getType().equals("CHILD")) {

                    for (String id : relationship.getIds()) {

                        BlockModel word = blockMap.get(id);

                        if (word.getBlockType().equals("WORD")) {
                            text += word.getText() + " ";
                        } else if (word.getBlockType().equals("SELECTION_ELEMENT")) {

                            if (word.getSelectionStatus().equals("SELECTED")) {
                                text += "X ";
                            }
                        }
                    }
                }
            }
        }

    } catch (Exception e) {
        LOG.error("Could not get cell text of table", e);
    }
    return text;
}

TableModel to create the view from:

public class TableModel {

private BlockModel table;
private Map<Long, Map<Long, String>> rowMap;

public TableModel(BlockModel table, Map<Long, Map<Long, String>> rowMap) {
    this.table = table;
    this.rowMap = rowMap;
}

public BlockModel getTable() {
    return table;
}

public void setTable(BlockModel table) {
    this.table = table;
}

public Map<Long, Map<Long, String>> getRowMap() {
    return rowMap;
}

public void setRowMap(Map<Long, Map<Long, String>> rowMap) {
    this.rowMap = rowMap;
}

@Override
public String toString() {
    return table.getId() + " - " + rowMap.toString();
}


来源:https://stackoverflow.com/questions/61086945/how-to-retrieve-tables-which-exists-in-a-pdf-using-aws-textract-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!