sorting lines of an enormous file.txt in java

大城市里の小女人 提交于 2019-11-28 11:19:41

I think the solution here is to do a merge sort using temporary files:

  1. Read the first n lines of the first file, (n being the number of lines you can afford to store and sort in memory), sort them, and write them to file 1.tmp (or however you call it). Do the same with the next n lines and store it in 2.tmp. Repeat until all lines of the original file has been processed.

  2. Read the first line of each temporary file. Determine the smallest one (according to your sort order), write it to the destination file, and read the next line from the corresponding temporary file. Repeat until all lines have been processed.

  3. Delete all the temporary files.

This works with arbitrary large files, as long as you have enough disk space.

You can run the following with

-mx1g -XX:+UseCompressedStrings  # on Java 6 update 29
-mx1800m -XX:-UseCompressedStrings  # on Java 6 update 29
-mx2g  # on Java 7 update 2.

import java.io.*;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

public class Main {
    public static void main(String... args) throws IOException {
        long start = System.nanoTime();
        generateFile("lines.txt", 755 * 1024 * 1024, 189000);

        List<String> lines = loadLines("lines.txt");

        System.out.println("Sorting file");
        Collections.sort(lines);
        System.out.println("... Sorted file");
        // save lines.
        long time = System.nanoTime() - start;
        System.out.printf("Took %.3f second to read, sort and write to a file%n", time / 1e9);
    }

    private static void generateFile(String fileName, int size, int lines) throws FileNotFoundException {
        System.out.println("Creating file to load");
        int lineSize = size / lines;
        StringBuilder sb = new StringBuilder();
        while (sb.length() < lineSize) sb.append('-');
        String padding = sb.toString();

        PrintWriter pw = new PrintWriter(fileName);
        for (int i = 0; i < lines; i++) {
            String text = (i + padding).substring(0, lineSize);
            pw.println(text);
        }
        pw.close();
        System.out.println("... Created file to load");
    }

    private static List<String> loadLines(String fileName) throws IOException {
        System.out.println("Reading file");
        BufferedReader br = new BufferedReader(new FileReader(fileName));
        List<String> ret = new ArrayList<String>();
        String line;
        while ((line = br.readLine()) != null)
            ret.add(line);
        System.out.println("... Read file.");
        return ret;
    }
}

prints

Creating file to load
... Created file to load
Reading file
... Read file.
Sorting file
... Sorted file
Took 4.886 second to read, sort and write to a file
e-zinc

Algorithm:

How much memory do we have available? Let’s assume we have X MB of memory available.

  1. Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.

  2. Now bring the next chunk into memory and sort.

  3. Once we’re done, merge them one by one.

The above algorithm is also known as external sort. Step 3 is known as N-way merge

Adel Boutros

divide and conquer is the best solution :)

divide your file to smaller ones, sort each file seperately then regroup.

Links:

Sort a file with huge volume of data given memory constraint

http://hackerne.ws/item?id=1603381

Why don't you try multithreading and increasing heap size of the program you are running? (this also requires you to use merge sort kind of thing provided you have more memory than 755mb in your system.)

Maybe u can use perl to format the file .and load into the database like mysql. it's so fast. and use the index to query the data. and write to another file.

u can set jvm heap size like '-Xms256m -Xmx1024m' .i hope to help u .thanks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!