Ruby 1.9: Regular Expressions with unknown input encoding

后端 未结 2 737
一个人的身影
一个人的身影 2020-12-10 12:44

Is there an accepted way to deal with regular expressions in Ruby 1.9 for which the encoding of the input is unknown? Let\'s say my input happens to be UTF-16 encoded:

相关标签:
2条回答
  • 2020-12-10 13:06

    As far as I am aware, there is no better method to use. However, might I suggest a slight alteration?

    Rather than changing the encoding of the input, why not change the encoding of the regex? Translating one regex string every time you meet a new encoding is a lot less work than translating hundreds or thousands of lines of input to match the encoding of your regex.

    # Utility function to make transcoding the regex simpler.
    def get_regex(pattern, encoding='ASCII', options=0)
      Regexp.new(pattern.encode(encoding),options)
    end
    
    
    
      # Inside code looping through lines of input.
      # The variables 'regex' and 'line_encoding' should be initialized previously, to
      # persist across loops.
      if line.methods.include?(:encoding)  # Ruby 1.8 compatibility
        if line.encoding != last_encoding
          regex = get_regex('<p>(.*)<\/p>',line.encoding,16) # //u = 00010000 option bit set = 16
          last_encoding = line.encoding
        end
      end
      line.match(regex)
    

    In the pathological case (where the input encoding changes every line) this would be just as slow, since you're re-encoding the regex every single time through the loop. But in 99.9% of situations where the encoding is constant for an entire file of hundreds or thousands of lines, this will result in a vast reduction in re-encoding.

    0 讨论(0)
  • 2020-12-10 13:20

    Follow the advice of this page: http://gnuu.org/2009/02/02/ruby-19-common-problems-pt-1-encoding/ and add

    # encoding: utf-8
    

    to the top of your rb file.

    0 讨论(0)
提交回复
热议问题