String split out of memory

问题

I have a large collection of tab separated text data in the form of DATE NAME MESSAGE. By large I mean, a collection of 1.76GB divided into 1075 actual files. I have to get the NAME data from all the files. Till now I have this:

   File f = new File(directory);
        File files[] = f.listFiles();
        // HashSet<String> all = new HashSet<String>();
        ArrayList<String> userCount = new ArrayList<String>();
        for (File file : files) {
            if (file.getName().endsWith(".txt")) {
                System.out.println(file.getName());
                BufferedReader in;
                try {
                    in = new BufferedReader(new FileReader(file));
                    String str;
                    while ((str = in.readLine()) != null) {
                        // if (all.add(str)) {
                        userCount.add(str.split("\t")[1]);
                        // }

                        // if (all.size() > 500)
                        // all.clear();
                    }
                    in.close();
                } catch (IOException e) {
                    System.err.println("Something went wrong: "
                            + e.getMessage());
                }

            }
        }

My program is always giving out of memory exception even with -Xmx1700. I cannot go beyond that. Is there anyway I can optimize the code so that it can handle the ArrayList<String> of NAMEs?

回答1:

Since you seem to be allowing alternative solutions than Java, here's an awk one that should handle it.

cat *.txt | awk -F'\t' '{sum[$2] += 1} END {for (name in sum) print name "," sum[name]}'

Explanation:

-F'\t' - separate on tabs
sum[$2] += 1 - increment the value for the second element (name)

Associative arrays make this extremely succinct. Running it on a test file I created as follows:

import random

def main():
    names = ['Nick', 'Frances', 'Carl']
    for i in range(10000):
        date = '2012-03-24'
        name = random.choice(names)
        message = 'asdf'
        print '%s\t%s\t%s' %(date, name, message)

if __name__ == '__main__':
    main()

I get the results:

Carl,3388
Frances,3277
Nick,3335

回答2:

There's a few things you can do to improve the memory footprint and general performance of your code:

Close your FileReader objects before moving on to the next one. FileReader is an InputStreamReader, which needs to call close() in order to free up resources. Your current code is effectively keeping a stream open for every file you're looking at.

for( File file: files ) {
    BufferedReader in = null;
    try{
        in = new BufferedReader( new FileReader( file ) );
        // TODO do whatever you want here.
    }
    finally{
        if( in != null ) {
            in.close();
        }
    }
}

If possible, eliminate storing all of your NAME values in the userCount ArrayList. Like A. R. S. suggested, you can write this information to another file first, and then just read the file when you need to pull that data again. If that's not an attractive option, you could still write your information to an OutputStream which is then piped to an InputStream elsewhere in your app. This would keep your data in memory, but wherever you're using the list of NAME values could begin processing/displaying/whatever concurrently, as you continue to read through these 1,000+ files searching for more NAME values.
Use the listFiles(FileFilter) method so Java can filter out non-text files for you. This should prevent a few extra CPU cycles, as you would no longer have to iterate over files with the incorrect extension before eliminating them.