问题
I'm doing some changes in Linux locale files /usr/share/i18n/locales (like pt_BR), and it's required that format strings (like %d-%m-%Y %H:%M) must be specified in Unicode, where each (in this case, ASCII) character is represented as <U00xx>.
So a text like this:
LC_TIME
d_t_fmt "%a %d %b %Y %T %Z"
d_fmt "%d-%m-%Y"
t_fmt "%T"
Must be:
LC_TIME
d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
d_fmt "<U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>"
t_fmt "<U0025><U0054>"
Thus I need a command-line script (be it bash, Python, Perl, or something else) that would take an input like %d-%m-%Y and convert it to <U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>.
All characters in the input string would be ASCII chars (from 0x20 to 0x7F), so this is actually a fancier "char-to-hex-string" conversion.
Could anyone please help me? My skills in bash scripting are very limited, and even worse in Python.
Bonus for elegant, explained solutions.
Thanks!
(by the way, this would be the "reverse" script for my previous question)
回答1:
Every char with file input
If you wanted to convert every character of a file to the unicode representation, then it would be this simple one-liner
while IFS= read -r -n1 c;do printf "<U%04X>" "'$c"; done < ./infile
Every char on STDIN
If you want to make a unix-like tool which converts input on STDIN to unicode-like output, then use this:
uni(){ c=$(cat); for((i=0;i<${#c};i++)); do printf "<U%04X>" "'${c:i:1}"; done; }
Proof of Concept
$ echo "abc" | uni
<U0061><U0062><U0063>
Only chars between double-quotes
#!/bin/bash
flag=0
while IFS= read -r -n1 c; do
if [[ "$c" == '"' ]]; then
((flag^=1))
printf "%c" "$c"
elif [[ "$c" == $'\0' ]]; then
echo
elif ((flag)); then
printf "<U%04X>" "'$c"
else
printf "%c" "$c"
fi
done < /path/to/infile
Proof of Concept
$ cat ./unime
LC_TIME
d_t_fmt "%a %d %b %Y %T %Z"
d_fmt "%d-%m-%Y"
t_fmt "%T"
abday "Dom";"Seg";/
here is a string with "multiline
quotes";/
$ ./uni.sh
LC_TIME
d_t_fmt "<U0025><U0061><U0020><U0025><U0064><U0020><U0025><U0062><U0020><U0025><U0059><U0020><U0025><U0054><U0020><U0025><U005A>"
d_fmt "<U0025><U0064><U002D><U0025><U006D><U002D><U0025><U0059>"
t_fmt "<U0025><U0054>"
abday "<U0044><U006F><U006D>";"<U0053><U0065><U0067>";/
here is a string with "<U006D><U0075><U006C><U0074><U0069><U006C><U0069><U006E><U0065>
<U0071><U0075><U006F><U0074><U0065><U0073>";/
Explanation
Pretty simply really
while IFS= read -r -n1 c;: Iterate over the input one character at a time (via-n1) and store the char in the variablec. TheIFS=and-rflags are there so that thereadbuiltin doesn't try to do word splitting or interpret escape sequences, respectively.if [[ "$c" == '"' ]];: If the current char is a double-quote((flag^=1)): Invert the value of flag from 0->1 or 1->0elif [[ "$c" == $'\0' ]];: If the current char is a NUL, thenechoa newlineelif ((flag)): If flag is 1, then perform unicode transliterationprintf "<U%04X>" "'$c": The magic that does the unicode transliteration. Note that the single-quote before the$cis mandatory as it tellsprintfthat we are giving it the ASCII representation of a number.else printf "%c" "$c": Print out the character with no unicode transliteration performed
回答2:
Using Python
#!/usr/bin/env python3.2
import sys
text = sys.argv[1]
encoded = "".join("<U{0:04X}>".format(ord(char)) for char in text)
print(encoded)
Usage:
$ python3 file.py "enter_input"
<U0065><U006E><U0074><U0065><U0072><U005F><U0069><U006E><U0070><U0075><U0074>
(The same script should work for both python 3.x and 2.x. Just change the version in shebang to the one you have.)
Explanation:
We need to import the sys module to read the command-line arguments.
The sys.argv list is the list of all command-line arguments. The entry [0] is the program name, entry [1] is the first argument, etc.
f(char) for char in textis a generator expression. It will loop for each character in thetextvariable, then apply the functionfon it, and finally collect the result as a lazy list (iterable).ord(char) finds the Unicode code-point of the character.
"<U{0:04X}>".format(x) is a string formatting method as described by the name. The format string takes 1 input
x, and format into the 04X format, meaning leading-zero, width-4, uppercase-hexadecimal."".join(it) concatenates all elements in the lazy list (iterable)
it. The""means the separator is an empty string.print(encoded) write the string
encodedto stdout.
回答3:
echo -n "aä" | ruby -KU -e '$<.chars{|c| print "<U"+"%04X"%c.unpack("U*")[0]+">"}; puts'
Outputs <U0061><U00E4>
-KU = $KCODE = "U"
回答4:
Shell script solution:
#!/bin/sh
while IFS= read -r -n1 c;
do printf "<U%04X>" "'$c";
done
This reads standard input and prints to standard output (assuming you've put the script into the executable file toUnicode.sh):
> echo "hello" | toUnicode.sh
<U0068><U0065><U006C><U006C><U006F><U0000>
This does print the EOF character (the <U0000>), but you can alter this script to suit your needs, whether you want to read the input one line at a time or trim it or alter it another way.
来源:https://stackoverflow.com/questions/5527981/script-to-convert-ascii-chars-to-uxxx-unicode-notation