I need to encode/convert a Unicode string to its escaped form, with backslashes. Anybody know how?
try this gem. It converts Unicode or non-ASCII punctuation and symbols to nearest ASCII punctuation and symbols
https://github.com/qwuen/punctuate
example usage: "100٪".punctuate => "100%"
the gem uses the reference in https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/DefaultTables/symbolTable.html for the conversion.
You can directly use unicode characters if you just add #Encoding: UTF-8
to the top of your file. Then you can freely use ä, ǹ, ú and so on in your source code.
In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"
There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.
String#dump
:If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt
that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$
(plus a trailing newline) – running the following code (executed either in irb
or as a script):
s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.
... should print out:
"\u20AC\u00A3\u00A5$\n"
Thus you can see that €
is U+20AC, £
is U+00A3, and ¥
is U+00A5. ($
is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)
(Note: a previous answer suggested using #inspect instead of #dump
. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect'
prints an unhappy face for me, rather than an escape sequence. Changing inspect
to dump
, though, gets me the escape sequence back.)
String#encode
and rescue
:Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:
encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
begin
c.encode("ASCII") # try to encode it to ASCII
rescue Encoding::UndefinedConversionError # but if that fails
encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
puts "#{char} encodes to #{dumped}."
end
With the same input as above, this would then print:
€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".
Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of
If you have Rails kicking around you can use the JSON encoder for this:
require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"
The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.
To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/