How do I skip white-space only lines and lines having variable columns using supercsv

牧云@^-^@ 提交于 2019-12-02 04:10:42

You can easily do this by writing your own Tokenizer.

For example, the following Tokenizer will have the same behaviour as the default one, but will skip over any lines that don't have the correct number of columns.

public class SkipBadColumnCountTokenizer extends Tokenizer {

    private final int expectedColumns;

    private final List<Integer> ignoredLines = new ArrayList<>();

    public SkipBadColumnCountTokenizer(Reader reader, 
            CsvPreference preferences, int expectedColumns) {
        super(reader, preferences);
        this.expectedColumns = expectedColumns;
    }

    @Override
    public boolean readColumns(List<String> columns) throws IOException {
        boolean moreInputExists;
        while ((moreInputExists = super.readColumns(columns)) && 
            columns.size() != this.expectedColumns){
            System.out.println(String.format("Ignoring line %s with %d columns: %s", getLineNumber(), columns.size(), getUntokenizedRow()));
            ignoredLines.add(getLineNumber());
        }

        return moreInputExists;

    }

    public List<Integer> getIgnoredLines(){
        return this.ignoredLines;
    }
}

And a simple test using this Tokenizer...

@Test
public void testInvalidRows() throws IOException {

    String input = "column1,column2,column3\n" +
            "has,three,columns\n" +
            "only,two\n" +
            "one\n" +
            "three,columns,again\n" +
            "one,too,many,columns";

    CsvPreference preference = CsvPreference.EXCEL_PREFERENCE;
    int expectedColumns = 3;
    SkipBadColumnCountTokenizer tokenizer = new SkipBadColumnCountTokenizer(
        new StringReader(input), preference, expectedColumns);

    try (ICsvBeanReader beanReader = new CsvBeanReader(tokenizer, preference)) {
        String[] header = beanReader.getHeader(true);
        TestBean bean;
        while ((bean = beanReader.read(TestBean.class, header)) != null){
            System.out.println(bean);
        }
        System.out.println(String.format("Ignored lines: %s", tokenizer.getIgnoredLines()));
    }

}

Prints the following output (notice how it's skipped all of the invalid rows):

TestBean{column1='has', column2='three', column3='columns'}
Ignoring line 3 with 2 columns: only,two
Ignoring line 4 with 1 columns: one
TestBean{column1='three', column2='columns', column3='again'}
Ignoring line 6 with 4 columns: one,too,many,columns
Ignored lines: [3, 4, 6]

(1) If the selection must be done by your Java program using Super CSV, then (and I quote) "you'll have to use CsvListReader". In particular: listReader.length()

See this Super CSV page for details.

(2) If you can perform the selection by preprocessing the CSV file, then you might wish to consider a suitable command-line tool (or tools, depending on how complicated the CSV format is). If the delimiter of the CSV file does not occur within any field, then awk would suffice. For example, if the assumption is satisfied, and if the delimiter is |, then the relevant awk filter could be as simple as:

awk -F'|' 'NF == 25 {print}'

If the CSV file format is too complex for a naive application of awk, then you may wish to convert the complex format to a simpler one; often TSV has much to recommend it.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!