How to group large stream into sub streams

五迷三道 提交于 2020-12-04 08:56:55


I want to group large Stream[F, A] into Stream[Stream[F, A]] with at most n element for inner stream.

This is what I did, basically pipe chunks into Queue[F, Queue[F, Chunk[A]], and then yields queue elements as result stream.

 implicit class StreamSyntax[F[_], A](s: Stream[F, A])(
    implicit F: Concurrent[F]) {

    def groupedPipe(
      lastQRef: Ref[F, Queue[F, Option[Chunk[A]]]],
      n: Int): Pipe[F, A, Stream[F, A]] = { in =>
      val initQs =
        Queue.unbounded[F, Option[Queue[F, Option[Chunk[A]]]]].flatMap { qq =>
          Queue.bounded[F, Option[Chunk[A]]](1).flatMap { q =>
            lastQRef.set(q) *> qq.enqueue1(Some(q)).as(qq -> q)

      Stream.eval(initQs).flatMap {
        case (qq, initQ) =>
          def newQueue = Queue.bounded[F, Option[Chunk[A]]](1).flatMap { q =>
            qq.enqueue1(Some(q)) *> lastQRef.set(q).as(q)

          val evalStream = {
              .evalMapAccumulate((0, initQ)) {
                case ((i, q), c) if i + c.size >= n =>
                  val (l, r) = c.splitAt(n - i)
                  q.enqueue1(Some(l)) >> q.enqueue1(None) >> q
                    .enqueue1(None) >> newQueue.flatMap { nq =>
                    nq.enqueue1(Some(r)).as(((r.size, nq), c))
                case ((i, q), c) if (i + c.size) < n =>
                  q.enqueue1(Some(c)).as(((i + c.size, q), c))
              .attempt ++ Stream.eval {
              lastQRef.get.flatMap { last =>
                last.enqueue1(None) *> last.enqueue1(None)
              } *> qq.enqueue1(None)
              q =>

    def grouped(n: Int) = {
      Stream.eval {
        Queue.unbounded[F, Option[Chunk[A]]].flatMap { empty =>
          Ref.of[F, Queue[F, Option[Chunk[A]]]](empty)
      }.flatMap { ref =>
        val p = groupedPipe(ref, n)

But it is very complicated, is there any simpler way ?


fs2 has chunkN chunkLimit methods that can help with grouping



chunkN produces chunks of size n until the end of a stream

chunkLimit splits existing chunks and can produce chunks with variable size.

scala> Stream(1,2,3).repeat.chunkN(2).take(5).toList
res0: List[Chunk[Int]] = List(Chunk(1, 2), Chunk(3, 1), Chunk(2, 3), Chunk(1, 2), Chunk(3, 1))

scala> (Stream(1) ++ Stream(2, 3) ++ Stream(4, 5, 6)).chunkLimit(2).toList
res0: List[Chunk[Int]] = List(Chunk(1), Chunk(2, 3), Chunk(4, 5), Chunk(6))


In addition to the already mentioned chunksN, also consider using groupWithin (fs2 1.0.1):

def groupWithin[F2[x] >: F[x]](n: Int, d: FiniteDuration)(implicit timer: Timer[F2], F: Concurrent[F2]): Stream[F2, Chunk[O]]

Divide this streams into groups of elements received within a time window, or limited by the number of the elements, whichever happens first. Empty groups, which can occur if no elements can be pulled from upstream in a given time window, will not be emitted.

Note: a time window starts each time downstream pulls.

I'm not sure why you'd want this to be nested streams, since the requirement is to have "at most n elements" in one batch - which implies that you're keeping track of a finite number of elements (which is exactly what a Chunk is for). Either way, a Chunk can always be represented as a Stream with Stream.chunk:

val chunks: Stream[F, Chunk[O]] = ???
val streamOfStreams:  Stream[F, Stream[F, O]] =

Here's a complete example of how to use groupWithin:

import cats.implicits._
import cats.effect.{ExitCode, IO, IOApp}
import fs2._
import scala.concurrent.duration._

object GroupingDemo extends IOApp {
  override def run(args: List[String]): IO[ExitCode] = {
    Stream('a, 'b, 'c).covary[IO]
      .groupWithin(2, 1.second)


List('a, 'b)


