all possible wordform completions of a (biomedical) word's stem

I'm familiar with word stemming and completion from the tm package in R.

I'm trying to come up with a quick and dirty method for finding all variants of a given word (within some corpus.) For example, I'd like to get "leukocytes" and "leuckocytic" if my input is "leukocyte".

If I had to do it right now, I would probably just go with something like:

library(tm)
library(RWeka)
dictionary <- unique(unlist(lapply(crude, words)))
grep(pattern = LovinsStemmer("company"), 
    ignore.case = T, x = dictionary, value = T)

I used Lovins because Snowball's Porter doesn't seem to be aggressive enough.

I'm open to suggestions for other stemmers, scripting languages (Python?), or entirely different approaches.

This solution requires preprocessing your corpus. But once that is done it is a very quick dictionary lookup.

from collections import defaultdict
from stemming.porter2 import stem

with open('/usr/share/dict/words') as f:
    words = f.read().splitlines()

stems = defaultdict(list)

for word in words:
    word_stem = stem(word)
    stems[word_stem].append(word)

if __name__ == '__main__':
    word = 'leukocyte'
    word_stem = stem(word)
    print(stems[word_stem])

For the /usr/share/dict/words corpus, this produces the result

['leukocyte', "leukocyte's", 'leukocytes']

It uses the stemming module that can be installed with

pip install stemming

来源：https://stackoverflow.com/questions/31596476/all-possible-wordform-completions-of-a-biomedical-words-stem

标签

python

nlp

bioinformatics

text-mining

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!