How can I programmatically find the list of codecs known to Python?

问题

I know that I can do the following:

>>> import encodings, pprint
>>> pprint.pprint(sorted(encodings.aliases.aliases.values()))
['ascii',
 'base64_codec',
 'big5',
 'big5hkscs',
 'bz2_codec',
 'cp037',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp424',
 'cp437',
 'cp500',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp932',
 'cp949',
 'cp950',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_jp',
 'euc_kr',
 'gb18030',
 'gb2312',
 'gbk',
 'hex_codec',
 'hp_roman8',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'johab',
 'koi8_r',
 'latin_1',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'mbcs',
 'ptcp154',
 'quopri_codec',
 'rot_13',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'tactis',
 'tis_620',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_7',
 'utf_8',
 'uu_codec',
 'zlib_codec']

I also know for sure that this is not a complete list, since it includes only encodings for which an alias exists (e.g "cp737" is missing), and at least some pseudo-encodings are missing (e.g "string_escape").

As the title of the question says: how can I programmatically get a list of all codecs/encodings known to Python?

If not programmatically: is there a complete list available online?

回答1:

I don't think the complete list is stored anywhere in the python standard library. Instead, encodings are loaded on demand through calls to encoding.search_function(encoding). If you study the code there, it looks like encoding string is first normalized and then the encodings package is searched for submodules whose name matches encoding.

The following uses pkgutil to list all the submodules of encoding, and then adds them to those listed in encoding.aliases.aliases.

Unfortunately, encoding.aliases.aliases contains one encoding, tactis that is not generated by the above, so I tried to generate the complete list by union-ing the two sets.

import encodings
import os
import pkgutil

modnames=set([modname for importer, modname, ispkg in pkgutil.walk_packages(
    path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases=set(encodings.aliases.aliases.values())

print(modnames-aliases)
# set(['charmap', 'unicode_escape', 'cp1006', 'unicode_internal', 'punycode', 'string_escape', 'aliases', 'palmos', 'mac_centeuro', 'mac_farsi', 'mac_romanian', 'cp856', 'raw_unicode_escape', 'mac_croatian', 'utf_8_sig', 'mac_arabic', 'undefined', 'cp737', 'idna', 'koi8_u', 'cp875', 'cp874', 'iso8859_1'])

print(aliases-modnames)
# set(['tactis'])

codec_names=modnames.union(aliases)
print(codec_names)
# set(['bz2_codec', 'cp1140', 'euc_jp', 'cp932', 'punycode', 'euc_jisx0213', 'aliases', 'hex_codec', 'cp500', 'uu_codec', 'big5hkscs', 'mac_romanian', 'mbcs', 'euc_jis_2004', 'iso2022_jp_3', 'iso2022_jp_2', 'iso2022_jp_1', 'gbk', 'iso2022_jp_2004', 'unicode_internal', 'utf_16_be', 'quopri_codec', 'cp424', 'iso2022_jp', 'mac_iceland', 'raw_unicode_escape', 'hp_roman8', 'iso2022_kr', 'cp875', 'iso8859_6', 'cp1254', 'utf_32_be', 'gb2312', 'cp850', 'shift_jis', 'cp852', 'cp855', 'iso8859_3', 'cp857', 'cp856', 'cp775', 'unicode_escape', 'cp1026', 'mac_latin2', 'utf_32', 'mac_cyrillic', 'base64_codec', 'ptcp154', 'palmos', 'mac_centeuro', 'euc_kr', 'hz', 'utf_8', 'utf_32_le', 'mac_greek', 'utf_7', 'mac_turkish', 'utf_8_sig', 'mac_arabic', 'tactis', 'cp949', 'zlib_codec', 'big5', 'iso8859_9', 'iso8859_8', 'iso8859_5', 'iso8859_4', 'iso8859_7', 'cp874', 'iso8859_1', 'utf_16_le', 'iso8859_2', 'charmap', 'gb18030', 'cp1006', 'shift_jis_2004', 'mac_roman', 'ascii', 'string_escape', 'iso8859_15', 'iso8859_14', 'tis_620', 'iso8859_16', 'iso8859_11', 'iso8859_10', 'iso8859_13', 'cp950', 'utf_16', 'cp869', 'mac_farsi', 'rot_13', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'shift_jisx0213', 'johab', 'mac_croatian', 'cp1255', 'latin_1', 'cp1257', 'cp1256', 'cp1251', 'cp1250', 'cp1253', 'cp1252', 'cp437', 'cp1258', 'undefined', 'cp737', 'koi8_r', 'cp037', 'koi8_u', 'iso2022_jp_ext', 'idna'])

回答2:

Either install the gettextcodecs library (one more?) or just (simpler) get the code from that library and use it. Not checked if it works in python <3.

来源：https://stackoverflow.com/questions/3824101/how-can-i-programmatically-find-the-list-of-codecs-known-to-python

标签

python

internationalization

character-encoding