Ruby : How can I detect/intelligently guess the delimiter used in a CSV file?

让人想犯罪 __ 提交于 2019-12-21 12:30:08

问题


I need to be able to figure out which delimiter is being used in a csv file (comma, space or semicolon) in my Ruby project. I know, there is a Sniffer class in Python in the csv module that can be used to guess a given file's delimiter. Is there anything similar to this in Ruby ? Any kind of help or idea is greatly appreciated.


回答1:


Looks like the py implementation just checks a few dialects: excel or excel_tab. So, a simple implementation of something that just checks for "," or "\t" is:

COMMON_DELIMITERS = ['","',"\"\t\""]

def sniff(path)
  first_line = File.open(path).first
  return nil unless first_line
  snif = {}
  COMMON_DELIMITERS.each {|delim|snif[delim]=first_line.count(delim)}
  snif = snif.sort {|a,b| b[1]<=>a[1]}
  snif.size > 0 ? snif[0][0] : nil
end

Note: that would return the full delimiter it finds, e.g. ",", so to get , you could change the snif[0][0] to snif[0][0][1].

Also, I'm using count(delim) because it is a little faster, but if you added a delimiter that is composed of two (or more) characters of the same type like --, then it would could each occurrence twice (or more) when weighing the type, so in that case, it may be better to use scan(delim).length.




回答2:


Here is Gary S. Weaver answer as we are using it in production. Good solution that works well.

class ColSepSniffer
  NoColumnSeparatorFound = Class.new(StandardError)
  EmptyFile = Class.new(StandardError)

  COMMON_DELIMITERS = [
    '","',
    '"|"',
    '";"'
  ].freeze

  def initialize(path:)
    @path = path
  end

  def self.find(path)
    new(path: path).find
  end

  def find
    fail EmptyFile unless first

    if valid?
      delimiters[0][0][1]
    else
      fail NoColumnSeparatorFound
    end
  end

  private

  def valid?
    !delimiters.collect(&:last).reduce(:+).zero?
  end

  # delimiters #=> [["\"|\"", 54], ["\",\"", 0], ["\";\"", 0]]
  # delimiters[0] #=> ["\";\"", 54]
  # delimiters[0][0] #=> "\",\""
  # delimiters[0][0][1] #=> ";"
  def delimiters
    @delimiters ||= COMMON_DELIMITERS.inject({}, &count).sort(&most_found)
  end

  def most_found
    ->(a, b) { b[1] <=> a[1] }
  end

  def count
    ->(hash, delimiter) { hash[delimiter] = first.count(delimiter); hash }
  end

  def first
    @first ||= file.first
  end

  def file
    @file ||= File.open(@path)
  end
end

Spec

require "spec_helper"

describe ColSepSniffer do
  describe ".find" do
    subject(:find) { described_class.find(path) }

    let(:path) { "./spec/fixtures/google/products.csv" }

    context "when , delimiter" do
      it "returns separator" do
        expect(find).to eq(',')
      end
    end

    context "when ; delimiter" do
      let(:path) { "./spec/fixtures/google/products_with_semi_colon_seperator.csv" }

      it "returns separator" do
        expect(find).to eq(';')
      end
    end

    context "when | delimiter" do
      let(:path) { "./spec/fixtures/google/products_with_bar_seperator.csv" }

      it "returns separator" do
        expect(find).to eq('|')
      end
    end

    context "when empty file" do
      it "raises error" do
        expect(File).to receive(:open) { [] }
        expect { find }.to raise_error(described_class::EmptyFile)
      end
    end

    context "when no column separator is found" do
      it "raises error" do
        expect(File).to receive(:open) { [''] }
        expect { find }.to raise_error(described_class::NoColumnSeparatorFound)
      end
    end
  end
end



回答3:


I'm not aware of any sniffer implementation in the CSV library included in Ruby 1.9. It will try to auto-discover the row separator, but the column separator is assumed to be a comma by default.

One idea would be to try parsing a sample number of rows (5% of total maybe?) using each of the possible separators. Whichever separator results in the same number of columns most consistently is probably the correct separator.



来源:https://stackoverflow.com/questions/14693929/ruby-how-can-i-detect-intelligently-guess-the-delimiter-used-in-a-csv-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!