问题
I would like to create a Pipe class to emulate Unix commands in Ruby in a two step fashion. First step is to compile a pipeline by adding a number of commands, and the second step is to run that pipeline. Here is a mockup:
#!/usr/bin/env ruby
p = Pipe.new
p.add(:cat, input: "table.txt")
p.add(:cut, field: 2)
p.add(:grep, pattern: "foo")
p.add(:puts, output: "result.txt")
p.run
The question is how to code this using lazy evaluation, so that the pipe is processed record by record when run()
is called without loading all of the data into memory at any one time?
回答1:
Take a look at the http://ruby-doc.org/core-2.0.0/Enumerator.html class. The Pipe
class will stitch together an Enumerator
, e.g. add(:cat, input: 'foo.txt')
will create an enumerator which yields lines of foo.txt
. add(:grep)
will filter it according to regexp etc.
Here's the lazy file reader
require 'benchmark'
def lazy_cat(filename)
e = Enumerator.new do |yielder|
f = File.open filename
s = f.gets
while s
yielder.yield s
s = f.gets
end
end
e.lazy
end
def cat(filename)
Enumerator.new do |yielder|
f = File.open filename
s = f.gets
while s
yielder.yield s
s = f.gets
end
end
end
lazy = Benchmark.realtime { puts lazy_cat("log.txt").map{|s| s.upcase}.take(1).to_a }
puts "Lazy: #{lazy}"
eager = Benchmark.realtime { puts cat("log.txt").map{|s| s.upcase}.take(1).to_a }
puts "Eager: #{eager}"
Eager version takes 7 seconds for 10 million line file, lazy version takes pretty much no time.
回答2:
For what I understood you can simply read one line at a time and move this single line thought the pipeline, then write it to the output. Some code:
output = File.new("output.txt")
File.new("input.txt").each do |line|
record = read_record(line)
newrecord = run_pipeline_on_one_record(record)
output.write(dump_record(newrecord))
end
Another much heavier option would be create actual IO blocking pipes and use one thread for each task in the pipeline. This somewhat reassembles what Unix does.
Sample usage with OP's syntax:
class Pipe
def initialize
@actions = []
end
def add(&block)
@actions << block
end
def run(infile, outfile)
output = File.open(outfile, "w")
File.open(infile).each do |line|
line.chomp!
@actions.each {|act| line = act[line] }
output.write(line+"\n")
end
end
end
p = Pipe.new
p.add {|line| line.size.to_s }
p.add {|line| "number of chars: #{line}" }
p.run("in.txt", "out.txt")
Sample in.txt
:
aaa
12345
h
Generated out.txt
:
number of chars: 3
number of chars: 5
number of chars: 1
回答3:
This seems to work:
#!/usr/bin/env ruby
require 'pp'
class Pipe
def initialize
@commands = []
end
def add(command, options = {})
@commands << [command, options]
self
end
def run
enum = nil
@commands.each do |command, options|
enum = method(command).call enum, options
end
enum.each {}
enum
end
def to_s
cmd_string = "Pipe.new"
@commands.each do |command, options|
opt_list = []
options.each do |key, value|
if value.is_a? String
opt_list << "#{key}: \"#{value}\""
else
opt_list << "#{key}: #{value}"
end
end
cmd_string << ".add(:#{command}, #{opt_list.join(", ")})"
end
cmd_string << ".run"
end
private
def cat(enum, options)
Enumerator.new do |yielder|
enum.map { |line| yielder << line } if enum
File.open(options[:input]) do |ios|
ios.each { |line| yielder << line }
end
end.lazy
end
def cut(enum, options)
Enumerator.new do |yielder|
enum.each do |line|
fields = line.chomp.split(%r{#{options[:delimiter]}})
yielder << fields[options[:field]]
end
end.lazy
end
def grep(enum, options)
Enumerator.new do |yielder|
enum.each do |line|
yielder << line if line.match(options[:pattern])
end
end.lazy
end
def save(enum, options)
Enumerator.new do |yielder|
File.open(options[:output], 'w') do |ios|
enum.each do |line|
ios.puts line
yielder << line
end
end
end.lazy
end
end
p = Pipe.new
p.add(:cat, input: "table.txt")
p.add(:cut, field: 2, delimiter: ',\s*')
p.add(:grep, pattern: "4")
p.add(:save, output: "result.txt")
p.run
puts p
回答4:
https://stackoverflow.com/a/20049201/3183101
require 'benchmark'
def lazy_cat(filename)
e = Enumerator.new do |yielder|
f = File.open filename
s = f.gets
while s
yielder.yield s
s = f.gets
end
end
e.lazy
end
def cat(filename)
Enumerator.new do |yielder|
f = File.open filename
s = f.gets
while s
yielder.yield s
s = f.gets
end
end
end
lazy = Benchmark.realtime { puts lazy_cat("log.txt").map{|s| s.upcase}.take(1).to_a }
puts "Lazy: #{lazy}"
eager = Benchmark.realtime { puts cat("log.txt").map{|s| s.upcase}.take(1).to_a }
puts "Eager: #{eager}"
This could have been simplified to the following, which I think makes the diff between the two methods easier to see.
require 'benchmark'
def cat(filename, evaluation_strategy: :eager)
e = Enumerator.new do |yielder|
f = File.open filename
s = f.gets
while s
yielder.yield s
s = f.gets
end
end
e.lazy if evaluation_strategy == :lazy
end
lazy = Benchmark.realtime { puts cat("log.txt", evaluation_strategy: :lazy).map{ |s|
s.upcase}.take(1).to_a
}
puts "Lazy: #{lazy}"
eager = Benchmark.realtime { puts cat("log.txt", evaluation_strategy: :eager).map{ |s|
s.upcase}.take(1).to_a
}
puts "Eager: #{eager}"
I would have just put this in a comment, but I'm too 'green' here to be permitted to do so. Anyway, the ability to post all of the code I think makes it clearer.
回答5:
This builds on previous answers, and serves as a warning about a gotcha regarding enumerators. An enumerator that hasn't been exhausted (i.e. raised StopIteration
) will not run ensure blocks. That means a construct like File.open { }
won't clean up after itself.
Example:
def lazy_cat(filename)
f = nil # visible to the define_singleton_method block
e = Enumerator.new do |yielder|
# Also stored in @f for demonstration purposes only, so we examine it later
@f = f = File.open filename
s = f.gets
while s
yielder.yield s
s = f.gets
end
end
e.lazy.tap do |enum|
# Provide a finish method to close the File
# We can't use def enum.finish because it can't see 'f'
enum.define_singleton_method(:finish) do
f.close
end
end
end
def get_first_line(path)
enum = lazy_cat(path)
enum.take(1).to_a
end
def get_first_line_with_finish(path)
enum = lazy_cat(path)
enum.take(1).to_a
ensure
enum.finish
end
# foo.txt contains:
# abc
# def
# ghi
puts "Without finish"
p get_first_line('foo.txt')
if @f.closed?
puts "OK: handle was closed"
else
puts "FAIL: handle not closed!"
@f.close
end
puts
puts "With finish"
p get_first_line_with_finish('foo.txt')
if @f.closed?
puts "OK: handle was closed"
else
puts "FAIL: handle not closed!"
@f.close
end
Running this produces:
Without finish
["abc\n"]
FAIL: handle not closed!
With finish
["abc\n"]
OK: handle was closed
Note that if you don't provide the finish
method, the stream won't be closed, and you'll leak file descriptors. It's possible that GC will close it, but you shouldn't depend on that.
来源:https://stackoverflow.com/questions/19999888/how-to-create-a-memory-efficient-ruby-pipe-class-with-lazy-evaluation