String split out of memory

雨燕双飞 提交于 2019-12-14 03:59:00

问题


I have a large collection of tab separated text data in the form of DATE NAME MESSAGE. By large I mean, a collection of 1.76GB divided into 1075 actual files. I have to get the NAME data from all the files. Till now I have this:

   File f = new File(directory);
        File files[] = f.listFiles();
        // HashSet<String> all = new HashSet<String>();
        ArrayList<String> userCount = new ArrayList<String>();
        for (File file : files) {
            if (file.getName().endsWith(".txt")) {
                System.out.println(file.getName());
                BufferedReader in;
                try {
                    in = new BufferedReader(new FileReader(file));
                    String str;
                    while ((str = in.readLine()) != null) {
                        // if (all.add(str)) {
                        userCount.add(str.split("\t")[1]);
                        // }

                        // if (all.size() > 500)
                        // all.clear();
                    }
                    in.close();
                } catch (IOException e) {
                    System.err.println("Something went wrong: "
                            + e.getMessage());
                }

            }
        }

My program is always giving out of memory exception even with -Xmx1700. I cannot go beyond that. Is there anyway I can optimize the code so that it can handle the ArrayList<String> of NAMEs?


回答1:


Since you seem to be allowing alternative solutions than Java, here's an awk one that should handle it.

cat *.txt | awk -F'\t' '{sum[$2] += 1} END {for (name in sum) print name "," sum[name]}'

Explanation:

-F'\t' - separate on tabs
sum[$2] += 1 - increment the value for the second element (name)

Associative arrays make this extremely succinct. Running it on a test file I created as follows:

import random

def main():
    names = ['Nick', 'Frances', 'Carl']
    for i in range(10000):
        date = '2012-03-24'
        name = random.choice(names)
        message = 'asdf'
        print '%s\t%s\t%s' %(date, name, message)

if __name__ == '__main__':
    main()

I get the results:

Carl,3388
Frances,3277
Nick,3335



回答2:


There's a few things you can do to improve the memory footprint and general performance of your code:

  1. Close your FileReader objects before moving on to the next one. FileReader is an InputStreamReader, which needs to call close() in order to free up resources. Your current code is effectively keeping a stream open for every file you're looking at.

    for( File file: files ) {
        BufferedReader in = null;
        try{
            in = new BufferedReader( new FileReader( file ) );
            // TODO do whatever you want here.
        }
        finally{
            if( in != null ) {
                in.close();
            }
        }
    }
    
  2. If possible, eliminate storing all of your NAME values in the userCount ArrayList. Like A. R. S. suggested, you can write this information to another file first, and then just read the file when you need to pull that data again. If that's not an attractive option, you could still write your information to an OutputStream which is then piped to an InputStream elsewhere in your app. This would keep your data in memory, but wherever you're using the list of NAME values could begin processing/displaying/whatever concurrently, as you continue to read through these 1,000+ files searching for more NAME values.

  3. Use the listFiles(FileFilter) method so Java can filter out non-text files for you. This should prevent a few extra CPU cycles, as you would no longer have to iterate over files with the incorrect extension before eliminating them.



回答3:


String.split returns Strings that use internally the same array of chars than the original String. The unused chars will not be garbage collected.

Try using new String( str.split("\t")[1]) to force the allocation of a new array.



来源:https://stackoverflow.com/questions/10360046/string-split-out-of-memory

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!