Read the 30Million user id's one by one from the big file

I am trying to read a very big file using Java. That big file will have data like this, meaning each line will have an user id.

And in that big file there will be around 30Million user id's. Now I am trying to read all the user id's one by one from that big file only once. Meaning each user id should be selected only once from that big file. For example, if I have 30Million user id's then it should print 30 Million user id only once with the use of Multithreading code.

Below is the code I have which is a multithreaded code running with 10 threads but with the below program, I am not able to make sure that each user id is selected only once.

public class ReadingFile {


    public static void main(String[] args) {

        // create thread pool with given size
        ExecutorService service = Executors.newFixedThreadPool(10);

        for (int i = 0; i < 10; i++) {
            service.submit(new FileTask());
        }
    }
}

class FileTask implements Runnable {

    @Override
    public void run() {

        BufferedReader br = null;
        try {
            br = new BufferedReader(new FileReader("D:/abc.txt"));
            String line;
            while ((line = br.readLine()) != null) {
                System.out.println(line);
                //do things with line
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                br.close();
            } catch (IOException e) {

                e.printStackTrace();
            }
        }
    }
}

Can anybody help me with this? What wrong I am doing? And what is the fastest way to do this?

Zim-Zam O'Pootertoot

You really can't improve on having one thread reading the file sequentially, assuming that you haven't done anything like stripe the file across multiple disks. With one thread, you do one seek and then one long sequential read; with multiple threads you're going to have the threads causing multiple seeks as each gains control of the disk head.

Edit: This is a way to parallelize the line processing while still using serial I/O to read the lines. It uses a BlockingQueue to communicate between threads; the FileTask adds lines to the queue, and the CPUTask reads them and processes them. This is a thread-safe data structure, so no need to add any synchronization to it. You're using put(E e) to add strings to the queue, so if the queue is full (it can hold up to 200 strings, as defined in the declaration in ReadingFile) the FileTask blocks until space frees up; likewise you're using take() to remove items from the queue, so the CPUTask will block until an item is available.

public class ReadingFile {
    public static void main(String[] args) {

        final int threadCount = 10;

        // BlockingQueue with a capacity of 200
        BlockingQueue<String> queue = new ArrayBlockingQueue<>(200);

        // create thread pool with given size
        ExecutorService service = Executors.newFixedThreadPool(threadCount);

        for (int i = 0; i < (threadCount - 1); i++) {
            service.submit(new CPUTask(queue));
        }

        // Wait til FileTask completes
        service.submit(new FileTask(queue)).get();

        service.shutdownNow();  // interrupt CPUTasks

        // Wait til CPUTasks terminate
        service.awaitTermination(365, TimeUnit.DAYS);

    }
}

class FileTask implements Runnable {

    private final BlockingQueue<String> queue;

    public FileTask(BlockingQueue<String> queue) {
        this.queue = queue;
    }

    @Override
    public void run() {
        BufferedReader br = null;
        try {
            br = new BufferedReader(new FileReader("D:/abc.txt"));
            String line;
            while ((line = br.readLine()) != null) {
                // block if the queue is full
                queue.put(line);
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                br.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

class CPUTask implements Runnable {

    private final BlockingQueue<String> queue;

    public CPUTask(BlockingQueue<String> queue) {
        this.queue = queue;
    }

    @Override
    public void run() {
        String line;
        while(true) {
            try {
                // block if the queue is empty
                line = queue.take(); 
                // do things with line
            } catch (InterruptedException ex) {
                break; // FileTask has completed
            }
        }
        // poll() returns null if the queue is empty
        while((line = queue.poll()) != null) {
            // do things with line;
        }
    }
}

We are talking about an average of a 315 MB file with lines separated by new line. I presume this easily fits into memory. It is implied that there is no particular order in the user names that has to be conserved. So I would recommend the following algorithm:

Get the file length
Copy each a 10th of the file into a byte buffer (binary copy should be fast)
Start a thread for processing each of these buffers
Each thread processes all lines in his area except the first and last one.
Each thread must return the first and last partitial line in its data when done,
the “last” of each thread must be recombined with the “first” one of the one working on the next file block because you may have cut through a line. And these tokens must then be processed afterwards.

Fork Join API introduced in 1.7 is a great fit for this use case. Check out http://docs.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html. If you search, you are going to find lots of examples out there.

来源：https://stackoverflow.com/questions/17220892/read-the-30million-user-ids-one-by-one-from-the-big-file

标签

java

multithreading

file

bufferedreader