I understand that for a normal Spindle Drive system, reading files using multiple threads is inefficient.
This is a different case, I have a high-throughput
You should first try the java 7 Files.readAllLines:
List<String> lines = Files.readAllLines(Paths.get(path), encoding);
Using a multi threaded approach is probably not a good option as it will force the filesystem to perform random reads (which is never a good thing on a file system)
Here is the solution to read a single file with multiple threads.
Divide the file into N chunks, read each chunk in a thread, then merge them in order. Beware of lines that cross chunk boundaries. It is the basic idea as suggested by user slaks
Bench-marking below implementation of multiple-threads for a single 20 GB file:
1 Thread : 50 seconds : 400 MB/s
2 Threads: 30 seconds : 666 MB/s
4 Threads: 20 seconds : 1GB/s
8 Threads: 60 seconds : 333 MB/s
Equivalent Java7 readAllLines() : 400 seconds : 50 MB/s
Note: This may only work on systems that are designed to support high-throughput I/O , and not on usual personal computers
package filereadtests;
import java.io.*;
import static java.lang.Math.toIntExact;
import java.nio.*;
import java.nio.channels.*;
import java.nio.charset.Charset;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class FileRead implements Runnable
{
private FileChannel _channel;
private long _startLocation;
private int _size;
int _sequence_number;
public FileRead(long loc, int size, FileChannel chnl, int sequence)
{
_startLocation = loc;
_size = size;
_channel = chnl;
_sequence_number = sequence;
}
@Override
public void run()
{
try
{
System.out.println("Reading the channel: " + _startLocation + ":" + _size);
//allocate memory
ByteBuffer buff = ByteBuffer.allocate(_size);
//Read file chunk to RAM
_channel.read(buff, _startLocation);
//chunk to String
String string_chunk = new String(buff.array(), Charset.forName("UTF-8"));
System.out.println("Done Reading the channel: " + _startLocation + ":" + _size);
} catch (Exception e)
{
e.printStackTrace();
}
}
//args[0] is path to read file
//args[1] is the size of thread pool; Need to try different values to fing sweet spot
public static void main(String[] args) throws Exception
{
FileInputStream fileInputStream = new FileInputStream(args[0]);
FileChannel channel = fileInputStream.getChannel();
long remaining_size = channel.size(); //get the total number of bytes in the file
long chunk_size = remaining_size / Integer.parseInt(args[1]); //file_size/threads
//Max allocation size allowed is ~2GB
if (chunk_size > (Integer.MAX_VALUE - 5))
{
chunk_size = (Integer.MAX_VALUE - 5);
}
//thread pool
ExecutorService executor = Executors.newFixedThreadPool(Integer.parseInt(args[1]));
long start_loc = 0;//file pointer
int i = 0; //loop counter
while (remaining_size >= chunk_size)
{
//launches a new thread
executor.execute(new FileRead(start_loc, toIntExact(chunk_size), channel, i));
remaining_size = remaining_size - chunk_size;
start_loc = start_loc + chunk_size;
i++;
}
//load the last remaining piece
executor.execute(new FileRead(start_loc, toIntExact(remaining_size), channel, i));
//Tear Down
executor.shutdown();
//Wait for all threads to finish
while (!executor.isTerminated())
{
//wait for infinity time
}
System.out.println("Finished all threads");
fileInputStream.close();
}
}