java split large files into smaller files while splitting the multiline record without breaking the record in incomplete state

青春壹個敷衍的年華 提交于 2021-01-05 07:18:26

问题


I have a record split into multiple lines in a file. Only way to identify the end of the record is when new record starts with ABC. Below is the sample. File size could be 5-10 GB and I am looking for a efficient java logic ONLY to split the files(no need of reading every line), but splitting logic should a check to start a new file with new record, which should start with "ABC" in this case.

Added few more details, I am just looking for splitting the file and while splitting the last record should be ended correctly in a file.

Can someone please suggest?

HDR
ABCline1goesonforrecord1   //first record 
line2goesonForRecord1      
line3goesonForRecord1          
line4goesonForRecord1
ABCline2goesOnForRecord2  //second record
line2goesonForRecord2
line3goesonForRecord2
line4goesonForRecord2
line5goesonForRecord2
ABCline2goesOnForRecord3     //third record
line2goesonForRecord3
line3goesonForRecord3
line4goesonForRecord3
TRL


回答1:


So, this is the code that you need. I tested on a 10Gb file and it takes 64 seconds to split the file

import java.io.BufferedWriter;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardOpenOption;
import java.util.concurrent.TimeUnit;

public class FileSplitter {

    private final Path filePath;
    private BufferedWriter writer;
    private int fileCounter = 1;

    public static void main(String[] args) throws Exception {
        long startTime = System.nanoTime();
        new FileSplitter(Path.of("/tmp/bigfile.txt")).split();
        System.out.println("Time to split " + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime() - startTime));
    }

    private static void generateBigFile() throws Exception {
        var writer = Files.newBufferedWriter(Path.of("/tmp/bigfile.txt"), StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING);
        for (int i = 0; i < 100_000; i++) {
            writer.write(String.format("ABCline1goesonforrecord%d\n", i + 1));
            for (int j = 0; j < 10_000; j++) {
                writer.write(String.format("line%dgoesonForRecord%d\n", j + 2, i + 1));
            }
        }

        writer.flush();
        writer.close();
    }

    public FileSplitter(Path filePath) {
        this.filePath = filePath;
    }

    void split() throws IOException {
        try (var stream = Files.lines(filePath, StandardCharsets.UTF_8)) {
            stream.forEach(line -> {
                if (line.startsWith("ABC")) {
                    closeWriter();
                    openWriter();
                }
                writeLine(line);
            });
        }
        closeWriter();
    }

    private void writeLine(String line) {
        if (writer != null) {
            try {
                writer.write(line);
                writer.write("\n");
            } catch (IOException e) {
                throw new UncheckedIOException("Failed to write line to file part", e);
            }
        }
    }

    private void openWriter() {
        if (this.writer == null) {
            var filePartName = filePath.getFileName().toString().replace(".", "_part" + fileCounter + ".");
            try {
                writer = Files.newBufferedWriter(Path.of("/tmp/split", filePartName), StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING);
            } catch (IOException e) {
                throw new UncheckedIOException("Failed to write line to file", e);
            }
            fileCounter++;
        }
    }

    private void closeWriter() {
        if (writer != null) {
            try {
                writer.flush();
                writer.close();
                writer = null;
            } catch (IOException e) {
                throw new UncheckedIOException("Failed to close writer", e);
            }
        }
    }
}

Btw, the solution with Scanner works too.

Regarding not reading all the lines, I don't see why don't you want this. If you choose not not read all the lines (it is possible) then, first you will overcomplicate the solution and second I'm pretty sure that you will loose from performance because of that logic that you have to incorporate in the splitting.




回答2:


I didn't test this but something like this should work, you are not reading the whole file in memory just one line at a time so it should not be bad.

public void spiltRecords(String filename) {
        /*
            HDR
            ABCline1goesonforrecord1   //first record
            line2goesonForRecord1
            line3goesonForRecord1
            line4goesonForRecord1
            ABCline2goesOnForRecord2  //second record
            line2goesonForRecord2
            line3goesonForRecord2
            line4goesonForRecord2
            line5goesonForRecord2
            ABCline2goesOnForRecord3     //third record
            line2goesonForRecord3
            line3goesonForRecord3
            line4goesonForRecord3
            TRL
         */
        try {
            Scanner scanFile = new Scanner(new File(filename));
            // now you do not want to edit the existing file in case things go wrong. one way is to get list of index
            // where a new record starts.
            LinkedList<Long> startOfRecordIndexes = new LinkedList<>();
            long index = 0;
            while (scanFile.hasNext()) {
                if (scanFile.nextLine().startsWith("ABC")) {
                    startOfRecordIndexes.add(index);
                }
                index++;
            }

            // Once you have the starting index for all records you can iterate through the list and create new records
            scanFile = scanFile.reset();
            index = 0;

            BufferedWriter writer = null;
            
            while (scanFile.hasNext()) {
                if (!startOfRecordIndexes.isEmpty() && index == startOfRecordIndexes.peek()) {
                    if(writer != null) {
                        writer.write("TRL");
                        writer.close();
                    }
                    writer = new BufferedWriter(new OutputStreamWriter(
                        new FileOutputStream("Give unique filename"), StandardCharsets.UTF_8));
                    writer.write("HDR");
                    writer.write(scanFile.nextLine());

                    startOfRecordIndexes.remove();
                } else {
                    writer.write(scanFile.nextLine());
                }
            }
            // Close the last record
            if(writer != null) {
                writer.write("TRL");
                writer.close();
            }
        } catch (IOException e) {
            // deal with exception
        }
    }


来源:https://stackoverflow.com/questions/65079254/java-split-large-files-into-smaller-files-while-splitting-the-multiline-record-w

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!