Asynchronous iteratee processing in Scalaz

左心房为你撑大大i 提交于 2019-12-07 03:22:35

问题


I've been using Scalaz 7 iteratees to process a large (i.e., unbounded) stream of data in constant heap space.

In code, it looks something like this:

type ErrorOrT[M[+_], A] = EitherT[M, Throwable, A]
type ErrorOr[A] = ErrorOrT[IO, A]

def processChunk(c: Chunk): Result

def process(data: EnumeratorT[Chunk, ErrorOr]): IterateeT[Chunk, ErrorOr, List[Result]] =
  Iteratee.fold[Chunk, ErrorOr, List[Result]](Nil) { (rs, c) =>
    processChunk(c) :: rs
  } &= data

Now I'd like to perform the processing in parallel, working on P chunks of data at a time. I still have to limit heap space, but it's reasonable to assume that there's enough heap to store P chunks of data and the accumulated results of the computation.

I'm aware of the Task class and thought of mapping over the enumerator to create a stream of tasks:

data map (c => Task.delay(processChunk(c)))

But I'm still not sure how to manage the non-determinism. While consuming the stream, how do I ensure that P tasks are running whenever possible?

First try:

My first stab at a solution was to fold over the stream and create a Scala Future to process each chunk. However, the program blew up with a GC overhead error (presumably because it was pulling all the chunks into memory as it tried to create all the Futures). Instead, the iteratee needs to stop consuming input when there are already P tasks running and resume again when any of the those tasks finish.

Second try:

My next attempt was to group the stream into P-sized parts, process each part in parallel, then join before moving on to the next part:

def process(data: EnumeratorT[Chunk, ErrorOr]): IterateeT[Chunk, ErrorOr, Vector[Result]] =
  Iteratee.foldM[Vector[Chunk], ErrorOr, Vector[Result]](Nil) { (rs, cs) =>
    tryIO(IO(rs ++ Await.result(
      Future.traverse(cs) { 
        c => Future(processChunk(c)) 
      }, 
      Duration.Inf)))
  } &= (data mapE Iteratee.group(P))

While this wouldn't fully utilize the available processors (especially since the time required to process each Chunk may vary widely), it would be an improvement. However, the group enumeratee seems to leak memory -- heap usage suddenly goes through the roof.

来源:https://stackoverflow.com/questions/19059831/asynchronous-iteratee-processing-in-scalaz

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!