NLTK word_tokenize on French text is not woking properly

走远了吗. 提交于 2019-12-04 11:15:06

I don't think there's an explicit French model for word_tokenize (which is the modified treebank tokenizer used for the English Penn Treebank). '

The word_tokenize function performs sentence tokenization using the sent_tokenize function before the actual word tokenization. The language argument in word_tokenize is only used for the sent_tokenize part.

Alternatively, you can use the MosesTokenizer that has some language dependent regexes (and it does support French):

>>> from nltk.tokenize.moses import MosesTokenizer
>>> moses = MosesTokenizer(lang='fr')
>>> sent = u"Le télétravail n'aura pas d'effet sur ma vie"
>>> moses.tokenize(sent)
[u'Le', u't\xe9l\xe9travail', u'n'', u'aura', u'pas', u'd'', u'effet', u'sur', u'ma', u'vie']

If you want don't like it that Moses escape special XML characters, you can do:

>>> moses.tokenize(sent, escape=False)
[u'Le', u't\xe9l\xe9travail', u"n'", u'aura', u'pas', u"d'", u'effet', u'sur', u'ma', u'vie']

To explain why splitting n' and d' is useful in French NLP.

Linguistically, separating the n' and d' does make sense because they're clitiques that have their own syntactic and semantic properties but bounded to the root/host.

In French, ne ... pas would have been a discontinuous constituent to denote negation, the clitique nature of ne going to n' is because of the vowel onset in the word following ne, so splitting the n' from the aura does make it easier to identify ne ... pas.

In the case of d', it's the same phonetic motivation of the vowel onset in the following word to go from de effet -> d'effet.

Looking at the source of word_tokenize reveals, that the language argument is only used to determine how to split the input into sentences. And for tokenization on word level, a (slightly modified) TreebankWordTokenizer is used which will work best for english input and contractions like don't. From nltk/tokenize/__init__.py:

_treebank_word_tokenizer = TreebankWordTokenizer()
# ... some modifications done
def word_tokenize(text, language='english', preserve_line=False):
    # ...
    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [token for sent in sentences
            for token in _treebank_word_tokenizer.tokenize(sent)]

To get your desired output, you might want to consider using a different tokenizer like a RegexpTokenizer as following:

txt = "Le télétravail n'aura pas d'effet sur ma vie"
pattern = r"[dnl]['´`]|\w+|\$[\d\.]+|\S+"
tokenizer = RegexpTokenizer(pattern)
tokenizer.tokenize(txt)
# ['Le', 'télétravail', "n'", 'aura', 'pas', "d'", 'effet', 'sur', 'ma', 'vie']

My knowledge of French is limited and this only solves the stated problem. For other cases you will have to adapt the pattern. You can also look at the implementation of the TreebankWordTokenizer for ideas of a more complex solution. Also keep in mind that this way you will need to split sentences beforehand, if necessary.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!