Lazy CSV Filtering / Parsing - Increasing Performance

有些话、适合烂在心里 提交于 2019-12-21 20:39:35

问题


Lazy Filtering CSV Files

I had the need to filter through millions of log records, stored as numerous CSV files. The size of the records greatly exceeded my available memory so I wanted to go with a lazy approach.

Java 8 Streams API

With jdk8 we have the Streams API which paired with Apache commons-csv allows us to easily accomplish this.

public class LazyFilterer {

    private static Iterable<CSVRecord> getIterable(String fileName) throws IOException {
        return CSVFormat
                .DEFAULT
                .withFirstRecordAsHeader()
                .parse(new BufferedReader(new FileReader(fileName)));
    }

    public static void main(String[] args) throws Exception {
        File dir = new File("csv");

        for (File file : dir.listFiles()) {
            Iterable<CSVRecord> iterable = getIterable(file.getAbsolutePath());

            StreamSupport.stream(iterable.spliterator(), true)
                    .filter(c -> c.get("API_Call").equals("Updates"))
                    .filter(c -> c.get("Remove").isEmpty())
                    .forEach(System.out::println);
        }
    }
}

Performance

This graph from VisualVM shows the memory usage during the parsing of 2.3 GB of CSV files using a more complex filtration pipeline1 than shown above.

As you can see, the memory usage basically remains constant2 as the filtration occurs.

Can you find another method to accomplish the same task more quickly while not increasing code complexity?

Any languages are welcome, Java is not necessarily preferred!

Footnotes

[1] - E.g. for each CSVRecord that matches on "API_Call" I might need to do some JSON deserialization and do additional filtering after that, or even create an object for certain records to facilitate additional computations.

[2] - The idle time at the beginning of the graph was a System.in.read() used to ensure that VisualVM was fully loaded before computation began.


回答1:


That's horrible for just 2.3GB of data, may I suggest you trying to use uniVocity-parsers for better performance? Try this:

CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true); // grabs headers from input

//select the fieds you are interested in. The filtered ones get in front to make things easier
settings.selectFields("API_Call", "Remove"/*, ... and everything else you are interested in*/);

//defines a processor to filter the rows you want
settings.setProcessor(new AbstractRowProcessor() {
    @Override
    public void rowProcessed(String[] row, ParsingContext context) {
        if (row[0].equals("Updates") && row[1].isEmpty()) {
            System.out.println(Arrays.toString(row));
        }
    }
});

// create the parser
CsvParser parser = new CsvParser(settings);

//parses everything. All rows will be sent to the processor defined above
parser.parse(file, "UTF-8"); 

I know it's not functional but it took 20 seconds to process a 4 GB file I created to test this, while consuming less than 75mb of memory the whole time. From your graphic it seems your current approach takes 1 minute for a smaller file, and needs 10 times as much memory.

Give this example a try, I believe it will help considerably.

Disclaimer, I'm the author of this library, it's open-source and free (Apache 2.0 license)



来源:https://stackoverflow.com/questions/39594923/lazy-csv-filtering-parsing-increasing-performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!