how to split a large text file into smaller chunks using java multithread

让人想犯罪 __ 提交于 2020-01-05 10:24:30

问题


I'm trying to develop a multithreaded java program for split a large text file into smaller text files. The smaller files created must have a prefixed number of lines. For example: if the number of lines of input file is 100 and the input number is 10, the result of my program is to split the input file into 10 files. I've already developed a singlethreaded version of my program:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;

public class TextFileSingleThreaded {

    public static void main(String[] args) {
        if (args.length != 2) {
            System.out.println("Invalid Input!");
        }

        //first argument is the file path
        File file = new File(args[0]);

        //second argument is the number of lines per chunk
        //In particular the smaller files will have numLinesPerChunk lines
        int numLinesPerChunk = Integer.parseInt(args[1]);

        BufferedReader reader = null;
        PrintWriter writer = null;
        try {
            reader = new BufferedReader(new FileReader(file));
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        String line;        

        long start = System.currentTimeMillis();

        try {
            line = reader.readLine();
            for (int i = 1; line != null; i++) {
                writer = new PrintWriter(new FileWriter(args[0] + "_part" + i + ".txt"));
                for (int j = 0; j < numLinesPerChunk && line != null; j++) {
                    writer.println(line);
                    line = reader.readLine();
                }
                writer.flush();
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        writer.close();

        long end = System.currentTimeMillis();

        System.out.println("Taken time[sec]:");
        System.out.println((end - start) / 1000);

    }

}

I want to write a multithreaded version of this program but I don't know how to read a file beginning from a specified line. Help me please. :(


回答1:


I want to write a multithreaded version of this program but I don't know how to read a file beginning from a specified line. Help me please. :(

I would not, as this implied, have each thread read from the beginning of the file ignoring lines until they come to their portion of the input file. This is highly inefficient. As you imply, the reader has to read all of the prior lines if the file is going to be divided up into chunks by lines. This means a whole bunch of duplicate read IO which will result in a much slower application.

You could instead have 1 reader and N writers. The reader will be adding the lines to be written to some sort of BlockingQueue per writer. The problem with this is that chances are you won't get any concurrency. Only one writer will most likely be working at one time while the rest of the writers wait for the reader to reach their part of the input file. Also, if the reader is faster than the writer (which is likely) then you could easily run out of memory queueing up all of the lines in memory if the file to be divided is large. You could use a size limited blocking queue which means the reader may block waiting for the writers but again, multiple writers will most likely not be running at the same time.

As mentioned in the comments, the most efficient way of doing this is single threaded because of these restrictions. If you are doing this as an exercise then it sounds like you will need to read the file through one time, note the start and end positions in the file for each of the output files and then fork the threads with those locations so they can re-read the file and write it into their separate output files in parallel without a lot of line buffering.




回答2:


You only need to read your file one time, and store it into a List :

BufferedReader br = new BufferedReader(new FileReader(new File("yourfile")));
List<String> list = new ArrayList<String>();
String line;
//for each line of your file
while((line = br.readLine()) != null){
    list.add(line);
}
br.close();

//then you can split your list into differents parts
List<List<String>> parts  = new ArrayList<ArrayList<String>>();
for(int i = 0; i < 10; i++){
  parts.add(new ArrayList<String>());
  for(int j =0; j < 10; j++){
    parts.get(i).add(list.get(i*10+j));
  }
}
//now you have 10 lists which each contain 10 lines
//you still need to to create a thread pool, where each thread put a list into a file

for more informations about thread pools, read this.



来源:https://stackoverflow.com/questions/17927398/how-to-split-a-large-text-file-into-smaller-chunks-using-java-multithread

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!