How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

后端 未结 9 2456
梦如初夏
梦如初夏 2020-12-03 03:25

I want to split a sentence into a list of words.

For English and European languages this is easy, just use split()

>>> \"This is a sentence.         


        
相关标签:
9条回答
  • 2020-12-03 03:53

    Best tokenizer tool for Chinese is pynlpir.

    import pynlpir
    pynlpir.open()
    mystring = "你汉语说的很好!"
    tokenized_string = pynlpir.segment(mystring, pos_tagging=False)
    
    >>> tokenized_string
    ['你', '汉语', '说', '的', '很', '好', '!']
    

    Be aware of the fact that pynlpir has a notorious but easy fixable problem with licensing, on which you can find plenty of solutions on the internet. You simply need to replace the NLPIR.user file in your NLPIR folder downloading a valide licence from this repository and restart your environment.

    0 讨论(0)
  • 2020-12-03 03:59

    It's partially possible with Japanese, where you usually have different character classes at the beginning and end of the word, but there are whole scientific papers on the subject for Chinese. I have a regular expression for splitting words in Japanese if you are interested: http://hg.hatta-wiki.org/hatta-dev/file/cd21122e2c63/hatta/search.py#l19

    0 讨论(0)
  • 2020-12-03 04:01

    just a word of caution: using list( '...' ) (in Py3; that's u'...' for Py2) will not, in the general sense, give you the characters of a unicode string; rather, it will most likely result in a series of 16bit codepoints. this is true for all 'narrow' CPython builds, which accounts for the vast majority of python installations today.

    when unicode was first proposed in the 1990s, it was suggested that 16 bits would be more than enough to cover all the needs of a universal text encoding, as it enabled a move from 128 codepoints (7 bits) and 256 codepoints (8 bits) to a whopping 65'536 codepoints. it soon became apparent, however, that that had been wishful thinking; today, around 100'000 codepoints are defined in unicode version 5.2, and thousands more are pending for inclusion. in order for that to become possible, unicode had to move from 16 to (conceptually) 32 bits (although it doesn't make full use of the 32bit address space).

    in order to maintain compatibility with software built on the assumption that unicode was still 16 bits, so-called surrogate pairs were devised, where two 16 bit codepoints from specifically designated blocks are used to express codepoints beyond 65'536, that is, beyond what unicode calls the 'basic multilingual plane', or BMP, and which are jokingly referred to as the 'astral' planes of that encoding, for their relative elusiveness and constant headache they offer to people working in the field of text processing and encoding.

    now while narrow CPython deals with surrogate pairs quite transparently in some cases, it will still fail to do the right thing in other cases, string splitting being one of those more troublesome cases. in a narrow python build, list( 'abc大

    0 讨论(0)
  • 2020-12-03 04:05

    Try this: http://code.google.com/p/pymmseg-cpp/

    0 讨论(0)
  • 2020-12-03 04:07

    The list() is the answer for Chinese only sentence. For those hybrid English/Chines in most of case. It answered at hybrid-split, just copy answer from Winter as below.

    def spliteKeyWord(str):
        regex = r"[\u4e00-\ufaff]|[0-9]+|[a-zA-Z]+\'*[a-z]*"
        matches = re.findall(regex, str, re.UNICODE)
        return matches
    
    0 讨论(0)
  • 2020-12-03 04:11

    Ok I figured it out.

    What I need can be accomplished by simply using list():

    >>> list(u"这是一个句子")
    [u'\u8fd9', u'\u662f', u'\u4e00', u'\u4e2a', u'\u53e5', u'\u5b50']
    

    Thanks for all your inputs.

    0 讨论(0)
提交回复
热议问题