How to convert Utf8 file to CP1252 by Unix

南楼画角 提交于 2020-07-22 12:47:05

问题


I'm trying to transform txt file encoding from UTF8 to ANSI (cp1252).

I need this because the file is used in a fixed position Oracle import (external Table) which apparently only supports CP1252. If I import an UTF-8 file, some special characters turn up as two incorrect characters instead.

I'm working in a Unix machine (my OS is HP UX). I have been looking for an answer on the web but I don't find any way to do this conversion.

For exmple, the POSIX iconv command doesn't have this choose, in fact UTF8 is used only as "to" encoding (-t) but never as "from" encoding (-f). iconv -l returns a long list with conversion pairs but UTF8 is always only in the second column.

How can I convert my file to CP1252 by UNIX?


回答1:


If your UTF-8 file only contains characters which are also representable as CP1252, you should be able to perform the conversion.

iconv -f utf-8 -t cp1252 <file.utf8 >file.txt

If, however, the UTF-8 text contains some characters which cannot be represented as CP1252, you have a couple of options:

  • Convert anyway, and have the converter omit the problematic characters
  • Convert anyway, and have the converter replace the problematic characters

This should be a conscious choice, so out of the box, iconv doesn't allow you to do this; but there are options to enable this behavior. Look at the -c option for the first behavior, and --unicode-subst for the second.

bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252
x
iconv: (stdin):1:1: cannot convert

bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252 -c
xy

bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252 --unicode-subst='?'
x?y

This is on OS X; apparently, Linux iconv lacks some of these options. Maybe look at recode and/or write your own simple conversion tool if you don't get the behavior you need out of iconv on your platform.

#!/usr/bin/env python
import sys
for line in sys.stdin:
    print(line.decode('utf-8').encode('cp1252', 'replace'))

Put 'ignore' instead of 'replace' to drop characters which cannot be represented. The default replacement character is ? like in the iconv example above.




回答2:


Have a look at this Java converter: native2ascii It is part of JDK installation.

The conversion is done in two steps:

native2ascii -encoding UTF-8 <your_file.txt> <your_file.txt.ascii>
native2ascii -reverse -encoding windows-1252 <your_file.txt.ascii> <your_file_new.txt>

Characters which are used in UTF-8 but not supported in CP1252 (including BOM) are replaced by ?



来源:https://stackoverflow.com/questions/29231275/how-to-convert-utf8-file-to-cp1252-by-unix

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!