what is the fastest way to get dimensions of a csv file in java

不羁岁月 提交于 2019-12-22 17:53:11

问题


My regular procedure when coming to the task on getting dimensions of a csv file as following:

  1. Get how many rows it has:

I use a while loop to read every lines and count up through each successful read. The cons is that it takes time to read the whole file just to count how many rows it has.

  1. then get how many columns it has: I use String[] temp = lineOfText.split(","); and then take the size of temp.

Is there any smarter method? Like:
file1 = read.csv;
xDimention = file1.xDimention;
yDimention = file1.yDimention;


回答1:


Your approach won't work with multi-line values (you'll get an invalid number of rows) and quoted values that might happen to contain the deliminter (you'll get an invalid number of columns).

You should use a CSV parser such as the one provided by univocity-parsers.

Using the uniVocity CSV parser, that fastest way to determine the dimensions would be with the following code. It parses a 150MB file to give its dimensions in 1.2 seconds:

// Let's create our own RowProcessor to analyze the rows
static class CsvDimension extends AbstractRowProcessor {

    int lastColumn = -1;
    long rowCount = 0;

    @Override
    public void rowProcessed(String[] row, ParsingContext context) {
        rowCount++;
        if (lastColumn < row.length) {
            lastColumn = row.length;
        }
    }
}

public static void main(String... args) throws FileNotFoundException {
     // let's measure the time roughly
    long start = System.currentTimeMillis();

    //Creates an instance of our own custom RowProcessor, defined above.
    CsvDimension myDimensionProcessor = new CsvDimension();

    CsvParserSettings settings = new CsvParserSettings();

    //This tells the parser that no row should have more than 2,000,000 columns
    settings.setMaxColumns(2000000);

    //Here you can select the column indexes you are interested in reading.
    //The parser will return values for the columns you selected, in the order you defined
    //By selecting no indexes here, no String objects will be created
    settings.selectIndexes(/*nothing here*/);

    //When you select indexes, the columns are reordered so they come in the order you defined.
    //By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
    settings.setColumnReorderingEnabled(false);

    //We instruct the parser to send all rows parsed to your custom RowProcessor. 
    settings.setRowProcessor(myDimensionProcessor);

    //Finally, we create a parser
    CsvParser parser = new CsvParser(settings);

    //And parse! All rows are sent to your custom RowProcessor (CsvDimension)
    //I'm using a 150MB CSV file with 1.3 million rows. 
    parser.parse(new FileReader(new File("c:/tmp/worldcitiespop.txt")));

    //Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
    System.out.println("Columns: " + myDimensionProcessor.lastColumn);
    System.out.println("Rows: " + myDimensionProcessor.rowCount);
    System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");

}

The output will be:

Columns: 7
Rows: 3173959
Time taken: 1279 ms

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).




回答2:


I guess it depends on how regular the structure is, and whether you need an exact answer or not.

I could imagine looking at the first few rows (or randomly skipping through the file), and then dividing the file size by average row size to determine a rough row count.

If you control how these files get written, you could potentially tag them or add a metadata file next to them containing row counts.

Strictly speaking, the way you're splitting the line doesn't cover all possible cases. "hello, world", 4, 5 should read as having 3 columns, not 4.




回答3:


IMO, What you are doing is an acceptable way to do it. But here are some ways you could make it faster:

  1. Rather than reading lines, which creates a new String Object for each line, just use String.indexOf to find the bounds of your lines
  2. Rather than using line.split, again use indexOf to count the number of commas
  3. Multithreading



回答4:


I guess here are the options which will depend on how you use the data:

  1. Store dimensions of your csv file when writing the file (in the first row or as in an additional file)
  2. Use a more efficient way to count lines - maybe http://docs.oracle.com/javase/6/docs/api/java/io/LineNumberReader.html
  3. Instead of creating an arrays of fixed size (assuming thats what you need the line count for) use array lists - this may or may not be more efficient depending on size of file.



回答5:


To find number of rows you have to read the whole file. There is nothing you can do here. However your method of finding number of cols is a bit inefficient. Instead of split just count how many times "," appeard in the line. You might also include here special condition about fields put in the quotas as mentioned by @Vlad.

String.split method creates an array of strings as a result and splits using regexp which is not very efficient.




回答6:


I find this short but interesting solution here: https://stackoverflow.com/a/5342096/4082824

LineNumberReader  lnr = new LineNumberReader(new FileReader(new File("File1")));
lnr.skip(Long.MAX_VALUE);
System.out.println(lnr.getLineNumber() + 1); //Add 1 because line index starts at 0
lnr.close();



回答7:


My solution is simply and correctly process CSV with multiline cells or quoted values.

for example We have csv-file:

1,"""2""","""111,222""","""234;222""","""""","1
2
3"
2,"""2""","""111,222""","""234;222""","""""","2
3"
3,"""5""","""1112""","""10;2""","""""","1
2"

And my solution snippet is:

import java.io.*;

public class CsvDimension {

    public void parse(Reader reader) throws IOException {
        long cells = 0;
        int lines = 0;
        int c;
        boolean qouted = false;
        while ((c = reader.read()) != -1) {
            if (c == '"') {
                 qouted = !qouted;
            }
            if (!qouted) {
                if (c == '\n') {
                    lines++;
                    cells++;
                }
                if (c == ',') {
                    cells++;
                }
            }
        }
        System.out.printf("lines : %d\n cells %d\n cols: %d\n", lines, cells, cells / lines);
        reader.close();
    }

    public static void main(String args[]) throws IOException {
        new CsvDimension().parse(new BufferedReader(new FileReader(new File("test.csv"))));
    }
}


来源:https://stackoverflow.com/questions/30624727/what-is-the-fastest-way-to-get-dimensions-of-a-csv-file-in-java

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!