问题

Lazy Filtering CSV Files

I had the need to filter through millions of log records, stored as numerous CSV files. The size of the records greatly exceeded my available memory so I wanted to go with a lazy approach.

Java 8 Streams API

With jdk8 we have the Streams API which paired with Apache commons-csv allows us to easily accomplish this.

public class LazyFilterer {

    private static Iterable<CSVRecord> getIterable(String fileName) throws IOException {
        return CSVFormat
                .DEFAULT
                .withFirstRecordAsHeader()
                .parse(new BufferedReader(new FileReader(fileName)));
    }

    public static void main(String[] args) throws Exception {
        File dir = new File("csv");

        for (File file : dir.listFiles()) {
            Iterable<CSVRecord> iterable = getIterable(file.getAbsolutePath());

            StreamSupport.stream(iterable.spliterator(), true)
                    .filter(c -> c.get("API_Call").equals("Updates"))
                    .filter(c -> c.get("Remove").isEmpty())
                    .forEach(System.out::println);
        }
    }
}

Performance

This graph from VisualVM shows the memory usage during the parsing of 2.3 GB of CSV files using a more complex filtration pipeline¹ than shown above.

As you can see, the memory usage basically remains constant² as the filtration occurs.

Can you find another method to accomplish the same task more quickly while not increasing code complexity?

Any languages are welcome, Java is not necessarily preferred!

Footnotes

_{[1] - E.g. for each CSVRecord that matches on "API_Call" I might need to do some JSON deserialization and do additional filtering after that, or even create an object for certain records to facilitate additional computations.}

_{[2] - The idle time at the beginning of the graph was a System.in.read() used to ensure that VisualVM was fully loaded before computation began.}

回答1:

That's horrible for just 2.3GB of data, may I suggest you trying to use uniVocity-parsers for better performance? Try this:

CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true); // grabs headers from input

//select the fieds you are interested in. The filtered ones get in front to make things easier
settings.selectFields("API_Call", "Remove"/*, ... and everything else you are interested in*/);

//defines a processor to filter the rows you want
settings.setProcessor(new AbstractRowProcessor() {
    @Override
    public void rowProcessed(String[] row, ParsingContext context) {
        if (row[0].equals("Updates") && row[1].isEmpty()) {
            System.out.println(Arrays.toString(row));
        }
    }
});

// create the parser
CsvParser parser = new CsvParser(settings);

//parses everything. All rows will be sent to the processor defined above
parser.parse(file, "UTF-8");

I know it's not functional but it took 20 seconds to process a 4 GB file I created to test this, while consuming less than 75mb of memory the whole time. From your graphic it seems your current approach takes 1 minute for a smaller file, and needs 10 times as much memory.

Give this example a try, I believe it will help considerably.

Disclaimer, I'm the author of this library, it's open-source and free (Apache 2.0 license)

来源：https://stackoverflow.com/questions/39594923/lazy-csv-filtering-parsing-increasing-performance

标签

csv