How do I skip white-space only lines and lines having variable columns using supercsv

前端 未结 2 1627
一向
一向 2021-01-25 00:56

I am working on CSV parser requirement and I am using supercsv parser library. My CSV file can have 25 columns(separated by tab(|)) and up to 100k rows with additional header ro

2条回答
  •  忘了有多久
    2021-01-25 01:24

    You can easily do this by writing your own Tokenizer.

    For example, the following Tokenizer will have the same behaviour as the default one, but will skip over any lines that don't have the correct number of columns.

    public class SkipBadColumnCountTokenizer extends Tokenizer {
    
        private final int expectedColumns;
    
        private final List ignoredLines = new ArrayList<>();
    
        public SkipBadColumnCountTokenizer(Reader reader, 
                CsvPreference preferences, int expectedColumns) {
            super(reader, preferences);
            this.expectedColumns = expectedColumns;
        }
    
        @Override
        public boolean readColumns(List columns) throws IOException {
            boolean moreInputExists;
            while ((moreInputExists = super.readColumns(columns)) && 
                columns.size() != this.expectedColumns){
                System.out.println(String.format("Ignoring line %s with %d columns: %s", getLineNumber(), columns.size(), getUntokenizedRow()));
                ignoredLines.add(getLineNumber());
            }
    
            return moreInputExists;
    
        }
    
        public List getIgnoredLines(){
            return this.ignoredLines;
        }
    }
    

    And a simple test using this Tokenizer...

    @Test
    public void testInvalidRows() throws IOException {
    
        String input = "column1,column2,column3\n" +
                "has,three,columns\n" +
                "only,two\n" +
                "one\n" +
                "three,columns,again\n" +
                "one,too,many,columns";
    
        CsvPreference preference = CsvPreference.EXCEL_PREFERENCE;
        int expectedColumns = 3;
        SkipBadColumnCountTokenizer tokenizer = new SkipBadColumnCountTokenizer(
            new StringReader(input), preference, expectedColumns);
    
        try (ICsvBeanReader beanReader = new CsvBeanReader(tokenizer, preference)) {
            String[] header = beanReader.getHeader(true);
            TestBean bean;
            while ((bean = beanReader.read(TestBean.class, header)) != null){
                System.out.println(bean);
            }
            System.out.println(String.format("Ignored lines: %s", tokenizer.getIgnoredLines()));
        }
    
    }
    

    Prints the following output (notice how it's skipped all of the invalid rows):

    TestBean{column1='has', column2='three', column3='columns'}
    Ignoring line 3 with 2 columns: only,two
    Ignoring line 4 with 1 columns: one
    TestBean{column1='three', column2='columns', column3='again'}
    Ignoring line 6 with 4 columns: one,too,many,columns
    Ignored lines: [3, 4, 6]
    

提交回复
热议问题