How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

后端 未结 9 2503
梦如初夏
梦如初夏 2020-12-03 03:25

I want to split a sentence into a list of words.

For English and European languages this is easy, just use split()

>>> \"This is a sentence.         


        
9条回答
  •  不知归路
    2020-12-03 03:53

    Best tokenizer tool for Chinese is pynlpir.

    import pynlpir
    pynlpir.open()
    mystring = "你汉语说的很好!"
    tokenized_string = pynlpir.segment(mystring, pos_tagging=False)
    
    >>> tokenized_string
    ['你', '汉语', '说', '的', '很', '好', '!']
    

    Be aware of the fact that pynlpir has a notorious but easy fixable problem with licensing, on which you can find plenty of solutions on the internet. You simply need to replace the NLPIR.user file in your NLPIR folder downloading a valide licence from this repository and restart your environment.

提交回复
热议问题