Convert JIS X 208 code to UTF-8 in Python

冷暖自知 提交于 2021-02-20 05:12:13

问题


Let's say I have this Kanji "亜" which is represented in JIS X 208 code in hex form: 0x3021. I want my Python program to convert that code into its UTF-8 form E4BA9C so that I can pass that string (URL-encoded) into my url like this

http://jisho.org/api/v1/search/words?keyword=%E4%BA%9C

I'm using Python 2.7.12 but I'm open to Python 3 solution as well


回答1:


These are accessed under ISO 2022 codec.

>>> '亜'.encode('iso2022_jp')
b'\x1b$B0!\x1b(B'

If I saw those bytes not framed by the escape sequence, I would have to know which version of JIS X 0208 is being used, but I'm entirely pattern matching on Wikipedia at this point anyway.

>>> b = b'\033$B' + bytes.fromhex('3021')
>>> c = b.decode('iso2022_jp')
>>> c
'亜'
>>> urllib.parse.quote(c)
'%E4%BA%9C'

(This is Python 3.)




回答2:


This solution may not be standard, but it seems to work.

CODE

import urllib.parse


def jis_to_euc_jp(jis_hex: str):
    """
    You can find the rules from this website: https://pentan.info/doc/jis_list.html

    8080 = A1A1 - 2121
    4B8080 = 8FA1C1 - 442141
    """
    int_jis = int(jis_hex, 16)
    step = int('8080', 16) if int_jis <= int('7426', 16) else int('4B8080', 16)
    return hex(int_jis + step).upper()[2:]  # 0X3021 -> 3021


def eucjp_to_utf_16be(eucjp_hex: str):
    byte_ch = bytes.fromhex(eucjp_hex)
    real_char = byte_ch.decode('euc_jp')  # '亜'
    # code = real_str.encode('utf-8').hex().upper()  # E4BA9C
    return real_char


def main():
    for v in ['亜'.encode('utf-8').hex().upper(),  # when glyph is know. E4BA9C

              # only know jis code, to find the real char
              jis_to_euc_jp('3021'),  # B0A1  # the Standard Encodings is provided euc-jp turn to utf-16be, so we need to know the relation between JIS and euc-jp
              eucjp_to_utf_16be(jis_to_euc_jp('3021'))
              ]:
        print(urllib.parse.quote(v))


if __name__ == '__main__':
    main()

E4BA9C
B0A1
%E4%BA%9C

REFERENCE

  • Standard Encodings: https://docs.python.org/3.7/library/codecs.html#standard-encodings
  • JIS TABLE: https://pentan.info/doc/jis_list.html


来源:https://stackoverflow.com/questions/43239935/convert-jis-x-208-code-to-utf-8-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!