Python - pyparsing unicode characters

烂漫一生 提交于 2019-11-29 03:48:53

As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode them back into whatever bytestring encoding you require.

If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...' to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).

Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:

unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
                                        if not unichr(c).isspace())

Now you can define trans using this more complete set of non-space characters:

trans = Word(unicodePrintables)

I was unable to test against your Hindi test string, but I think this will do the trick.

(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:

unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
                                        if not chr(c).isspace())

EDIT:

With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.

import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!