Error Parsing due to CSV Differences Before/After Saving (Java w/ Apache Commons CSV)

问题

I have a 37 column CSV file that I am parsing in Java with Apache Commons CSV 1.2. My setup code is as follows:

//initialize FileReader object
FileReader fileReader = new FileReader(file);

//intialize CSVFormat object
CSVFormat csvFileFormat = CSVFormat.DEFAULT.withHeader(FILE_HEADER_MAPPING);

//initialize CSVParser object
CSVParser csvFileParser = new CSVParser(fileReader, csvFileFormat);

//Get a list of CSV file records
List<CSVRecord> csvRecords = csvFileParser.getRecords();

// process accordingly

My problem is that when I copy the CSV to be processed to my target directory and run my parsing program, I get the following error:

Exception in thread "main" java.lang.IllegalArgumentException: Index for header 'Title' is 7 but CSVRecord only has 6 values!
        at org.apache.commons.csv.CSVRecord.get(CSVRecord.java:110)
        at launcher.QualysImport.createQualysRecords(Unknown Source)
        at launcher.QualysImport.importQualysRecords(Unknown Source)
        at launcher.Main.main(Unknown Source)

However, if I copy the file to my target directory, open and save it, then try the program again, it works. Opening and saving the CSV adds back the commas needed at the end so my program won't compain about not having enough headers to read.

For context, here is a sample line of before/after saving:

Before (failing): "data","data","data","data"

After (working): "data","data",,,,"data",,,"data",,,,,,

So my question: why does the CSV format change when I open and save it? I'm not changing any values or encoding, and the behavior is the same for MS-DOS or regular .csv format when saving. Also, I'm using Excel to copy/open/save in my testing.

Is there some encoding or format setting I need to be using? Can I solve this programmatically?

Thanks in advance!

EDIT #1:

For additional context, when I first view an empty line in the original file, it just has the new line ^M character like this:

^M

After opening in Excel and saving, it looks like this with all 37 of my empty fields:

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,^M

Is this a Windows encoding discrepancy?

回答1:

Maybe that's a compatibility issue with whatever generated the file in the first place. It seems that Excel accepts a blank line as a valid row with empty strings in each column, with the number of columns to match some other row(s). Then it saves it according to CSV conventions with the column delimiter. (the ^M is the Carriage Return character; on Microsoft systems it precedes the Line Feed character at the end of a line in text files)

Perhaps you can deal with it by creating your own Reader subclass to sit between the FileReader and the CSVParser. Your reader will read a line, and if it is blank then return a line with the correct number of commas. Otherwise just return the line as-is.

For example:

class MyCSVCompatibilityReader extends BufferedReader
    {
    private final BufferedReader delegate;

    public MyCSVCompatibilityReader(final FileReader fileReader)
        {
        this.delegate = new BufferedReader(fileReader);
        }

    @Override
    public String readLine()
        {
        final String line = this.delegate.readLine();
        if ("".equals(line.trim())
            { return ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"; }
        else
            { return line; }
        }
    }

There are a lot of other details to implement correctly when implementing the interface. You'll need to pass through calls to all the other methods (close, ready, reset, skip, etc.), and ensure that each of the various read() methods work correctly. It might be easier, if the file will fit in memory easily, to just read the file and write the fixed version to a new StringWriter then create a StringReader to the CSVParser.

回答2:

Maybe try this: Creates a parser for the given File. parse(File file, Charset charset, CSVFormat format)

//import import java.nio.charset.StandardCharsets; //StandardCharsets.UTF_8

Note: This method internally creates a FileReader using FileReader.FileReader(java.io.File) which in turn relies on the default encoding of the JVM that is executing the code.

回答3:

Or maybe try withAllowMissingColumnNames?

//intialize CSVFormat object 
CSVFormat csvFileFormat = CSVFormat.DEFAULT.withHeader(FILE_HEADER_MAPPING).withAllowMissingColumnNames();

来源：https://stackoverflow.com/questions/36653173/error-parsing-due-to-csv-differences-before-after-saving-java-w-apache-commons

标签

java

csv

encoding

apache-commons-csv