Iterate over lines in a file in parallel (Scala)?

后端 未结 5 1555
遥遥无期
遥遥无期 2020-12-12 22:33

I know about the parallel collections in Scala. They are handy! However, I would like to iterate over the lines of a file that is too large for memory in parallel. I coul

5条回答
  •  悲哀的现实
    2020-12-12 22:58

    I'll put this as a separate answer since it's fundamentally different from my last one (and it actually works)

    Here's an outline for a solution using actors, which is basically what Kim Stebel's comment describes. There are two actor classes, a single FileReader actor that reads individual lines from the file on demand, and several Worker actors. The workers all send requests for lines to the reader, and process lines in parallel as they are read from the file.

    I'm using Akka actors here but using another implementation is basically the same idea.

    case object LineRequest
    case object BeginProcessing
    
    class FileReader extends Actor {
    
      //reads a single line from the file or returns None if EOF
      def getLine:Option[String] = ...
    
      def receive = {
        case LineRequest => self.sender.foreach{_ ! getLine} //sender is an Option[ActorRef]
      }
    }
    
    class Worker(reader: ActorRef) extends Actor {
    
      def process(line:String) ...
    
      def receive = {
        case BeginProcessing => reader ! LineRequest
        case Some(line) => {
          process(line)
          reader ! LineRequest
        }
        case None => self.stop
      }
    }
    
    val reader = actorOf[FileReader].start    
    val workers = Vector.fill(4)(actorOf(new Worker(reader)).start)
    workers.foreach{_ ! BeginProcessing}
    //wait for the workers to stop...
    

    This way, no more than 4 (or however many workers you have) unprocessed lines are in memory at a time.

提交回复
热议问题