Searching for Unicode characters in Python

后端 未结 1 613
野趣味
野趣味 2020-12-22 11:50

I\'m working on a NLP project based on Python/NLTK with non-english unicode text. For that, I need to search unicode string inside a sentence.

There is a .tx

相关标签:
1条回答
  • 2020-12-22 12:09

    If I understand correctly, you just have to split up the sentence into words, loop over each one and check if it ends or starts with the required characters, e.g:

    >>> sentence = ['AASFG', 'BBBSDC', 'FEKGG', 'SDFGF']
    >>> [word for word in sentence.split() if word.endswith("GF")]
    ['SDFGF']
    

    sentence.split() could probably be replaced with something like nltk.tokenize.word_tokenize(sentence)

    Update, regarding comment:

    How can get word in-front of that and behind it

    The enumerate function can be used to give each word a number, like this:

    >>> print list(enumerate(sentence))
    [(0, 'AASFG'), (1, 'BBBSDC'), (2, 'FEKGG'), (3, 'SDFGF')]
    

    Then if you do the same loop, but preserve the index:

    >>> results = [(idx, word) for (idx, word) in enumerate(sentence) if word.endswith("GG")]
    >>> print results
    [(2, 'FEKGG')]
    

    ..you can use the index to get the next or previous item:

    >>> for r in results:
    ...     r_idx = r[0]
    ...     print "Prev", sentence[r_idx-1]
    ...     print "Next", sentence[r_idx+1]
    ...
    Prev BBBSDC
    Next SDFGF
    

    You'd need to handle the case where the match the very first or last word (if r_idx == 0, if r_idx == len(sentence))

    0 讨论(0)
提交回复
热议问题