How to read all lines of a file in parallel in Java 8

喜夏-厌秋 提交于 2019-12-28 05:58:14

问题


I want to read all lines of a 1 GB large file as fast as possible into a Stream<String>. Currently I'm using Files(path).lines() for that. After parsing the file, I'm doing some computations (map()/filter()) At first I thought this is already done in parallel, but it seems I'm wrong: When reading the file as it is, it takes about 50 seconds on my dual CPU laptop. However, if I split the file using bash commands and then process them in parallel, it only takes about 30 seconds.

I tried the following combinations:

  1. single file, no parallel lines() stream ~ 50 seconds
  2. single file, Files(..).lines().parallel().[...] ~ 50 seconds
  3. two files, no parallel lines() strean ~ 30 seconds
  4. two files, Files(..).lines().parallel().[...] ~ 30 seconds

I ran these 4 multiple times with roughly the same results (by 1 or 2 seconds). The [...] is a chain of map and filter only, with a toArray(...) at the end to trigger the evaluation.

The conclusion is that there is no difference in using lines().parallel(). As reading two files in parallel takes a shorter time, there is a performance gain from splitting the file. However it seems the whole file is read serially.

Edit: I want to point out that I use an SSD, so there is practically to seeking time. The file has 1658652 (relatively short) lines in total. Splitting the file in bash takes about 1.5 seconds: time split -l 829326 file # 829326 = 1658652 / 2 split -l 829326 file 0,14s user 1,41s system 16% cpu 9,560 total

So my question is, is there any class or function in the Java 8 JDK which can parallelize reading all lines without having to split it first? For example, if I have two CPU cores, the first line reader should start at the first line and a second one at line (totalLines/2)+1.


回答1:


You might find some help from this post. Trying to parallelize the actual reading of a file is probably barking up the wrong tree, as the biggest slowdown will be your file system (even on an SSD).

If you set up a file channel in memory, you should be able to process the data in parallel from there with great speed, but chances are you won't need it as you'll see a huge speed increase.



来源:https://stackoverflow.com/questions/25711616/how-to-read-all-lines-of-a-file-in-parallel-in-java-8

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!