How to create a memory efficient Ruby Pipe class with lazy evaluation?

问题

I would like to create a Pipe class to emulate Unix commands in Ruby in a two step fashion. First step is to compile a pipeline by adding a number of commands, and the second step is to run that pipeline. Here is a mockup:

#!/usr/bin/env ruby

p = Pipe.new
p.add(:cat, input: "table.txt")
p.add(:cut, field: 2)
p.add(:grep, pattern: "foo")
p.add(:puts, output: "result.txt")
p.run

The question is how to code this using lazy evaluation, so that the pipe is processed record by record when run() is called without loading all of the data into memory at any one time?

回答1:

Take a look at the http://ruby-doc.org/core-2.0.0/Enumerator.html class. The Pipe class will stitch together an Enumerator, e.g. add(:cat, input: 'foo.txt') will create an enumerator which yields lines of foo.txt. add(:grep) will filter it according to regexp etc.

Here's the lazy file reader

require 'benchmark'

def lazy_cat(filename)
  e = Enumerator.new do |yielder|
    f = File.open filename
    s = f.gets
    while s
      yielder.yield s
      s = f.gets
    end
  end
  e.lazy
end

def cat(filename)
  Enumerator.new do |yielder|
    f = File.open filename
    s = f.gets
    while s
      yielder.yield s
      s = f.gets
    end
  end
end

lazy = Benchmark.realtime { puts lazy_cat("log.txt").map{|s| s.upcase}.take(1).to_a }
puts "Lazy: #{lazy}"

eager = Benchmark.realtime { puts cat("log.txt").map{|s| s.upcase}.take(1).to_a }
puts "Eager: #{eager}"

Eager version takes 7 seconds for 10 million line file, lazy version takes pretty much no time.

回答2:

For what I understood you can simply read one line at a time and move this single line thought the pipeline, then write it to the output. Some code:

output = File.new("output.txt")
File.new("input.txt").each do |line|
    record = read_record(line)
    newrecord = run_pipeline_on_one_record(record)
    output.write(dump_record(newrecord))
end

Another much heavier option would be create actual IO blocking pipes and use one thread for each task in the pipeline. This somewhat reassembles what Unix does.

Sample usage with OP's syntax:

class Pipe
    def initialize
        @actions = []
    end
    def add(&block)
        @actions << block
    end
    def run(infile, outfile)
        output = File.open(outfile, "w")
        File.open(infile).each do |line|
            line.chomp!
            @actions.each {|act| line = act[line] }
            output.write(line+"\n")
        end
    end
end

p = Pipe.new
p.add {|line| line.size.to_s }
p.add {|line| "number of chars: #{line}" }
p.run("in.txt", "out.txt")

Sample in.txt:

aaa
12345
h

Generated out.txt:

number of chars: 3
number of chars: 5
number of chars: 1

回答3:

This seems to work:

#!/usr/bin/env ruby

require 'pp'

class Pipe
  def initialize
    @commands = []
  end

  def add(command, options = {})
    @commands << [command, options]

    self
  end

  def run
    enum = nil

    @commands.each do |command, options|
      enum = method(command).call enum, options
    end

    enum.each {}

    enum
  end

  def to_s
    cmd_string = "Pipe.new"

    @commands.each do |command, options|
      opt_list = []

      options.each do |key, value|
        if value.is_a? String
          opt_list << "#{key}: \"#{value}\""
        else
          opt_list << "#{key}: #{value}"
        end
      end

      cmd_string << ".add(:#{command}, #{opt_list.join(", ")})"
    end

    cmd_string << ".run"
  end

  private

  def cat(enum, options)
    Enumerator.new do |yielder|
      enum.map { |line| yielder << line } if enum

      File.open(options[:input]) do |ios|
        ios.each { |line| yielder << line }
      end
    end.lazy
  end

  def cut(enum, options)
    Enumerator.new do |yielder|
      enum.each do |line|
        fields = line.chomp.split(%r{#{options[:delimiter]}})

        yielder << fields[options[:field]]
      end
    end.lazy
  end

  def grep(enum, options)
    Enumerator.new do |yielder|
      enum.each do |line|
        yielder << line if line.match(options[:pattern])
      end
    end.lazy
  end

  def save(enum, options)
    Enumerator.new do |yielder|
      File.open(options[:output], 'w') do |ios|
        enum.each do |line|
          ios.puts line
          yielder << line
        end
      end
    end.lazy
  end
end

p = Pipe.new
p.add(:cat, input: "table.txt")
p.add(:cut, field: 2, delimiter: ',\s*')
p.add(:grep, pattern: "4")
p.add(:save, output: "result.txt")
p.run

puts p

回答4:

https://stackoverflow.com/a/20049201/3183101

require 'benchmark'

def lazy_cat(filename)
  e = Enumerator.new do |yielder|
    f = File.open filename
    s = f.gets
    while s
      yielder.yield s
      s = f.gets
    end
  end
  e.lazy
end

def cat(filename)
  Enumerator.new do |yielder|
    f = File.open filename
    s = f.gets
    while s
      yielder.yield s
      s = f.gets
    end
  end
end

lazy = Benchmark.realtime { puts lazy_cat("log.txt").map{|s| s.upcase}.take(1).to_a }
puts "Lazy: #{lazy}"

eager = Benchmark.realtime { puts cat("log.txt").map{|s| s.upcase}.take(1).to_a }
puts "Eager: #{eager}"

This could have been simplified to the following, which I think makes the diff between the two methods easier to see.

require 'benchmark'

def cat(filename, evaluation_strategy: :eager)
  e = Enumerator.new do |yielder|
    f = File.open filename
    s = f.gets
    while s
      yielder.yield s
      s = f.gets
    end
  end
  e.lazy if evaluation_strategy == :lazy
end

lazy = Benchmark.realtime { puts cat("log.txt", evaluation_strategy: :lazy).map{ |s|
  s.upcase}.take(1).to_a 
}
puts "Lazy: #{lazy}"

eager = Benchmark.realtime { puts cat("log.txt", evaluation_strategy: :eager).map{ |s|
  s.upcase}.take(1).to_a 
}
puts "Eager: #{eager}"

I would have just put this in a comment, but I'm too 'green' here to be permitted to do so. Anyway, the ability to post all of the code I think makes it clearer.

回答5:

This builds on previous answers, and serves as a warning about a gotcha regarding enumerators. An enumerator that hasn't been exhausted (i.e. raised StopIteration) will not run ensure blocks. That means a construct like File.open { } won't clean up after itself.

Example:

def lazy_cat(filename)
  f = nil  # visible to the define_singleton_method block
  e = Enumerator.new do |yielder|
    # Also stored in @f for demonstration purposes only, so we examine it later
    @f = f = File.open filename
    s = f.gets
    while s
      yielder.yield s
      s = f.gets
    end
  end
  e.lazy.tap do |enum|
    # Provide a finish method to close the File
    # We can't use def enum.finish because it can't see 'f'
    enum.define_singleton_method(:finish) do
      f.close
    end
  end
end

def get_first_line(path)
  enum = lazy_cat(path)
  enum.take(1).to_a
end

def get_first_line_with_finish(path)
  enum = lazy_cat(path)
  enum.take(1).to_a
ensure
  enum.finish
end


# foo.txt contains:
# abc
# def
# ghi

puts "Without finish"
p get_first_line('foo.txt')
if @f.closed?
  puts "OK: handle was closed"
else
  puts "FAIL: handle not closed!"
  @f.close
end
puts

puts "With finish"
p get_first_line_with_finish('foo.txt')
if @f.closed?
  puts "OK: handle was closed"
else
  puts "FAIL: handle not closed!"
  @f.close
end

Running this produces:

Without finish
["abc\n"]
FAIL: handle not closed!

With finish
["abc\n"]
OK: handle was closed

Note that if you don't provide the finish method, the stream won't be closed, and you'll leak file descriptors. It's possible that GC will close it, but you shouldn't depend on that.

来源：https://stackoverflow.com/questions/19999888/how-to-create-a-memory-efficient-ruby-pipe-class-with-lazy-evaluation

标签

ruby

lazy-evaluation