merge CSV files on a common field with ruby/fastercsv

问题

I have a 'master' file with a number of columns: 1 2 3 4 5. I have a few other files, with fewer rows than the master file, each with columns: 1 6. I'd like to merge these files matching on the column 1 field and add column 6 to the master. I've seen some python/UNIX solutions but would prefer to use ruby/fastercsv if it's a good fit. I would appreciate any help getting started.

回答1:

FasterCSV is now the default CSV implementation in Ruby 1.9. This code is untested, but should work.

require 'csv'
master = CSV.read('master.csv') # Reads in master
master.each {|each| each.push('')} # Adds another column to all rows
Dir.glob('*.csv').each do |each| #Goes thru all csv files
  next if each == 'master.csv' # skips the master csv file
  file = CSV.read(each) # Reads in each one
  file.each do |line| #Goes thru each line of the file
    temp = master.assoc(line[0]) # Finds the appropriate line in master
    temp[-1] = line[1] if temp #updates last column if line is found
  end
end

csv = CSV.open('output.csv','wb') #opens output csv file for writing
master.each {|each| csv << each} #Goes thru modified master and saves it to file

回答2:

$ cat j4.csv
how, now, brown, cow, f1
now, is, the, time, f2
one, two, three, four, five
xhow, now, brown, cow, f1
xnow, is, the, time, f2
xone, two, three, four, five
$ cat j4a.csv
how, b
one, d
$ cat hj.rb
require 'pp'
require 'rubygems'
require 'fastercsv'

pp(
  FasterCSV.read('j4a.csv').inject(
    FasterCSV.read('j4.csv').inject({}) do |m, e|
      m[e[0]] = e
      m
    end) do |m, e|
    k = e[0]
    m[k] << e.last if m[k]
    m
  end.values)
$ ruby hj.rb
[["now", " is", " the", " time", " f2"],
 ["xhow", " now", " brown", " cow", " f1"],
 ["xone", " two", " three", " four", " five"],
 ["how", " now", " brown", " cow", " f1", " b"],
 ["one", " two", " three", " four", " five", " d"],
 ["xnow", " is", " the", " time", " f2"]]

This works by mapping your master file into a hash with column one as the key, and then it just looks up the key from your other files. As written the code appends the last column when the keys match. Since you have more than one non-master file, you could adapt the concept by replacing FasterCSV.read('j4a.csv') with a method that reads each file and concatenates them all into a single array of arrays, or you could just save the result from the inner inject (the master hash) and apply each other file to it in a loop.

回答3:

temp = master.assoc(line[0])

The above is a super slow process. The whole complex is at least O(n^2).

I would use the below process:

for 1 6 csv, convert it to a big hash with 1 as key and 6 as value, named as 1_to_6_hash
loop 1 2 3 4 5 csv row by row, set row[6] = 1_to_6_hash[row[1]]

It will dramatically reduce complex to O(n)

来源：https://stackoverflow.com/questions/7947000/merge-csv-files-on-a-common-field-with-ruby-fastercsv

标签

ruby

fastercsv