How to process chuncks of a file with java.util.stream

心已入冬 提交于 2019-12-01 11:29:49

Here is a solution which can be used for converting a Stream<String>, each element representing a line, to a Stream<List<String>>, each element representing a chunk found using a specified delimiter:

public class ChunkSpliterator implements Spliterator<List<String>> {
    private final Spliterator<String> source;
    private final Predicate<String> start, end;
    private final Consumer<String> getChunk;
    private List<String> current;

    ChunkSpliterator(Spliterator<String> lineSpliterator,
        Predicate<String> chunkStart, Predicate<String> chunkEnd) {
        source=lineSpliterator;
        start=chunkStart;
        end=chunkEnd;
        getChunk=s -> {
            if(current!=null) current.add(s);
            else if(start.test(s)) current=new ArrayList<>();
        };
    }
    public boolean tryAdvance(Consumer<? super List<String>> action) {
        while(current==null || current.isEmpty()
                            || !end.test(current.get(current.size()-1)))
            if(!source.tryAdvance(getChunk)) return false;
        current.remove(current.size()-1);
        action.accept(current);
        current=null;
        return true;
    }
    public Spliterator<List<String>> trySplit() {
        return null;
    }
    public long estimateSize() {
        return Long.MAX_VALUE;
    }
    public int characteristics() {
        return ORDERED|NONNULL;
    }

    public static Stream<List<String>> toChunks(Stream<String> lines,
        Predicate<String> chunkStart, Predicate<String> chunkEnd,
        boolean parallel) {

        return StreamSupport.stream(
            new ChunkSpliterator(lines.spliterator(), chunkStart, chunkEnd),
            parallel);
    }
}

The lines matching the predicates are not included in the chunk; it would be easy to change this behavior, if desired.

It can be used like this:

ChunkSpliterator.toChunks( Files.lines(Paths.get(myFile)),
    Pattern.compile("^<start>$").asPredicate(),
    Pattern.compile("^<stop>$").asPredicate(),
    true )
   .collect(new MyProcessOneBucketCollector<>())

The patterns are specifying as ^word$ to require the entire line to consist of the word only; without these anchors, lines containing the pattern can start and end a chunk. The nature of the source stream does not allow parallelism when creating the chunks, so when chaining with an immediate collection operation the parallelism for the entire operation is rather limited. It depends on the MyProcessOneBucketCollector if there can be any parallelism at all.

If your final result does not depend on the order of occurrences of the buckets in the source file, it is strongly recommended that either your collector reports itself to be UNORDERED or you insert an unordered() in the stream’s method chains before the collect.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!