问题
I'm trying to transform txt file encoding from UTF8 to ANSI (cp1252).
I need this because the file is used in a fixed position Oracle import (external Table) which apparently only supports CP1252. If I import an UTF-8 file, some special characters turn up as two incorrect characters instead.
I'm working in a Unix machine (my OS is HP UX). I have been looking for an answer on the web but I don't find any way to do this conversion.
For exmple, the POSIX iconv
command doesn't have this choose, in fact UTF8 is used only as "to" encoding (-t
) but never as "from" encoding (-f
). iconv -l
returns a long list with conversion pairs but UTF8 is always only in the second column.
How can I convert my file to CP1252 by UNIX?
回答1:
If your UTF-8 file only contains characters which are also representable as CP1252, you should be able to perform the conversion.
iconv -f utf-8 -t cp1252 <file.utf8 >file.txt
If, however, the UTF-8 text contains some characters which cannot be represented as CP1252, you have a couple of options:
- Convert anyway, and have the converter omit the problematic characters
- Convert anyway, and have the converter replace the problematic characters
This should be a conscious choice, so out of the box, iconv
doesn't allow you to do this; but there are options to enable this behavior. Look at the -c
option for the first behavior, and --unicode-subst
for the second.
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252
x
iconv: (stdin):1:1: cannot convert
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252 -c
xy
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252 --unicode-subst='?'
x?y
This is on OS X; apparently, Linux iconv
lacks some of these options. Maybe look at recode and/or write your own simple conversion tool if you don't get the behavior you need out of iconv
on your platform.
#!/usr/bin/env python
import sys
for line in sys.stdin:
print(line.decode('utf-8').encode('cp1252', 'replace'))
Put 'ignore
' instead of 'replace'
to drop characters which cannot be represented. The default replacement character is ?
like in the iconv
example above.
回答2:
Have a look at this Java converter: native2ascii It is part of JDK installation.
The conversion is done in two steps:
native2ascii -encoding UTF-8 <your_file.txt> <your_file.txt.ascii>
native2ascii -reverse -encoding windows-1252 <your_file.txt.ascii> <your_file_new.txt>
Characters which are used in UTF-8 but not supported in CP1252 (including BOM) are replaced by ?
来源:https://stackoverflow.com/questions/29231275/how-to-convert-utf8-file-to-cp1252-by-unix