Better way than using `Task/produce/consume` for lazy collections express as coroutines

问题

It is very convenient to use Tasks to express a lazy collection / a generator.

Eg:

function fib()
    Task() do
        prev_prev = 0
        prev = 1
        produce(prev)
        while true
            cur = prev_prev + prev
            produce(cur)
            prev_prev = prev
            prev = cur
        end
    end
end

collect(take(fib(), 10))

Output:

10-element Array{Int64,1}:
  1
  1
  2
  3
  5
  8
 13
 21
 34

However, they do not follow good iterator conventions at all. They are as badly behaved as they can be

They do not use the returned state state

start(fib()) == nothing #It has no state

So they are instead mutating the iterator object itself. An proper iterator uses its state, rather than ever mutating itself, so they multiple callers can iterate it at once. Creating that state with start, and advancing it during next.

Debate-ably, that state should be immutable with next returning a new state, so that can be trivially teeed. (On the other hand, allocating new memory -- though on the stack)

Further-more, the hidden state, it not advanced during next. The following does not work:

@show ff = fib()
@show state = start(ff)
@show next(ff, state)

Output:

ff = fib() = Task (runnable) @0x00007fa544c12230
state = start(ff) = nothing
next(ff,state) = (nothing,nothing)

Instead the hidden state is advanced during done: The following works:

@show ff = fib()
@show state = start(ff)
@show done(ff,state)     
@show next(ff, state)

Output:

ff = fib() = Task (runnable) @0x00007fa544c12230
state = start(ff) = nothing
done(ff,state) = false
next(ff,state) = (1,nothing)

Advancing state during done isn't the worst thing in the world. After all, it is often the case that it is hard to know when you are done, without going to try and find the next state. One would hope done would always be called before next. Still it is not great, since the following happens:

ff = fib()
state = start(ff)
done(ff,state)
done(ff,state)
done(ff,state)
done(ff,state)
done(ff,state)
done(ff,state)
@show next(ff, state)

Output:

next(ff,state) = (8,nothing)

Which is really now what you expect. It is reasonably to assume that done is safe to call multiple times.

Basically Tasks make poor iterators. In many cases they are not compatible with other code that expects an iterator. (In many they are, but it is hard to tell which from which). This is because Tasks are not really for use as iterators, in these "generator" functions. They are intended for low-level control flow. And are optimized as such.

So what is the better way? Writing an iterator for fib isn't too bad:

immutable Fib end
immutable FibState
    prev::Int
    prevprev::Int
end

Base.start(::Fib) = FibState(0,1)
Base.done(::Fib, ::FibState) = false
function Base.next(::Fib, s::FibState)
    cur = s.prev + s.prevprev
    ns = FibState(cur, s.prev)
    cur, ns
end

Base.iteratoreltype(::Type{Fib}) = Base.HasEltype()
Base.eltype(::Type{Fib}) = Int
Base.iteratorsize(::Type{Fib}) = Base.IsInfinite()

But is is a bit less intuitive. For more complex functions, it is much less nice.

So my question is: What is a better way to have something that works like as Task does, as a way to buildup a iterator from a single function, but that is well behaved?

I would not be surprised if someone has already written a package with a macro to solve this.

回答1:

The current iterator interface for Tasks is fairly simple:

# in share/julia/base/task.jl
275 start(t::Task) = nothing
276 function done(t::Task, val)
277     t.result = consume(t)
278     istaskdone(t)
279 end
280 next(t::Task, val) = (t.result, nothing)

Not sure why the devs chose to put the consumption step in the done function rather than the next function. This is what is producing your weird side-effect. To me it sounds much more straightforward to implement the interface like this:

import Base.start; function Base.start(t::Task) return t end
import Base.next;  function Base.next(t::Task, s::Task) return consume(s), s end
import Base.done;  function Base.done(t::Task, s::Task) istaskdone(s) end

Therefore, this is what I would propose as the answer to your question.

I think this simpler implementation is a lot more meaningful, fulfils your criteria above, and even has the desired outcome of outputting a meaningful state: the Task itself! _{(which you're allowed to "inspect" if you really want to, as long as that doesn't involve consumption :p )}.

However, there are certain caveats:

Caveat 1: The task is REQUIRED to have a return value, signifying the final element in the iteration, otherwise "unexpected" behaviour might occur.

I'm assuming the devs chose the first approach to avoid exactly this kind of "unintended" output; however I believe this should have actually been the expected behaviour! A task expected to be used as an iterator should be expected to define an appropriate iteration endpoint (by means of a clear return value) by design!

Example 1: The wrong way to do it

julia> t = Task() do; for i in 1:10; produce(i); end; end;
julia> collect(t) |> show
Any[1,2,3,4,5,6,7,8,9,10,nothing] # last item is a return value of nothing
                                  # correponding to the "return value" of the
                                  # for loop statement, which is 'nothing'.
                                  # Presumably not the intended output!

Example 2: Another wrong way to do it

julia> t = Task() do; produce(1); produce(2); produce(3); produce(4); end;
julia> collect(t) |> show
Any[1,2,3,4,()] # last item is the return value of the produce statement,
                # which returns any items passed to it by the last
                # 'consume' call; in this case an empty tuple.
                # Presumably not the intended output!

Example 3: The (in my humble opinion) right way to do it!.

julia> t = Task() do; produce(1); produce(2); produce(3); return 4; end;
julia> collect(t) |> show
[1,2,3,4] # An appropriate return value ending the Task function ensures an
          # appropriate final value for the iteration, as intended.

Caveat 2: The task should not be modified / consumed further inside the iteration (a common requirement with iterators in general), except in the understanding that this intentionally causes a 'skip' in the iteration (which would be a hack at best, and presumably not advisable).

Example:

julia> t = Task() do; produce(1); produce(2); produce(3); return 4; end;
julia> for i in t; show(consume(t)); end
24

More Subtle example:

julia> t = Task() do; produce(1); produce(2); produce(3); return 4; end;
julia> for i in t   # collecting i is a consumption event
        for j in t  # collecting j is *also* a consumption event
          show(j)
        end
       end # at the end of this loop, i = 1, and j = 4
234

Caveat 3: With this scheme it is expected behaviour that you can 'continue where you left off'. e.g.

julia> t = Task() do; produce(1); produce(2); produce(3); return 4; end;
julia> take(t, 2) |> collect |> show
[1,2]
julia> take(t, 2) |> collect |> show
[3,4]

However, if one would prefer the iterator to always start from the pre-consumption state of a task, the start function could be modified to achieve this:

import Base.start; function Base.start(t::Task) return Task(t.code) end;
import Base.next;  function Base.next(t::Task, s::Task) consume(s), s end;
import Base.done;  function Base.done(t::Task, s::Task) istaskdone(s) end;

julia> for i in t
         for j in t
           show(j)
         end
       end # at the end of this loop, i = 4, and j = 4 independently
1234123412341234

Interestingly, note how this variant would affect the 'inner consumption' scenario from 'caveat 2':

julia> t = Task() do; produce(1); produce(2); produce(3); return 4; end;
julia> for i in t; show(consume(t)); end
1234
julia> for i in t; show(consume(t)); end
4444

See if you can spot why this makes sense! :)

Having said all this, there is a philosophical point about whether it even matters that the way a Task behaves with the start, next, and done commands matters at all, in that, these functions are considered "an informal interface": i.e. they are supposed to be "under the hood" functions, not intended to be called manually.

Therefore, as long as they do their job and return the expected iteration values, you shouldn't care too much about how they do it under the hood, even if technically they don't quite follow the 'spec' while doing so, since you were never supposed to be calling them manually in the first place.

回答2:

How about the following (uses fib defined in OP):

type NewTask
  t::Task
end

import Base: start,done,next,iteratorsize,iteratoreltype

start(t::NewTask) = istaskdone(t.t)?nothing:consume(t.t)
next(t::NewTask,state) = (state==nothing || istaskdone(t.t)) ?
  (state,nothing) : (state,consume(t.t))
done(t::NewTask,state) = state==nothing
iteratorsize(::Type{NewTask}) = Base.SizeUnknown()
iteratoreltype(::Type{NewTask}) = Base.EltypeUnknown()

function fib()
    Task() do
        prev_prev = 0
        prev = 1
        produce(prev)
        while true
            cur = prev_prev + prev
            produce(cur)
            prev_prev = prev
            prev = cur
        end
    end
end
nt = NewTask(fib())
take(nt,10)|>collect

This is a good question, and is possibly better suited to the Julia list (now on Discourse platform). In any case, using defined NewTask an improved answer to a recent StackOverflow question is possible. See: https://stackoverflow.com/a/41068765/3580870

来源：https://stackoverflow.com/questions/41072425/better-way-than-using-task-produce-consume-for-lazy-collections-express-as-cor

标签

generator

julia

coroutine