all possible wordform completions of a (biomedical) word's stem

删除回忆录丶 提交于 2019-12-01 05:46:19

问题


I'm familiar with word stemming and completion from the tm package in R.

I'm trying to come up with a quick and dirty method for finding all variants of a given word (within some corpus.) For example, I'd like to get "leukocytes" and "leuckocytic" if my input is "leukocyte".

If I had to do it right now, I would probably just go with something like:

library(tm)
library(RWeka)
dictionary <- unique(unlist(lapply(crude, words)))
grep(pattern = LovinsStemmer("company"), 
    ignore.case = T, x = dictionary, value = T)

I used Lovins because Snowball's Porter doesn't seem to be aggressive enough.

I'm open to suggestions for other stemmers, scripting languages (Python?), or entirely different approaches.


回答1:


This solution requires preprocessing your corpus. But once that is done it is a very quick dictionary lookup.

from collections import defaultdict
from stemming.porter2 import stem

with open('/usr/share/dict/words') as f:
    words = f.read().splitlines()

stems = defaultdict(list)

for word in words:
    word_stem = stem(word)
    stems[word_stem].append(word)

if __name__ == '__main__':
    word = 'leukocyte'
    word_stem = stem(word)
    print(stems[word_stem])

For the /usr/share/dict/words corpus, this produces the result

['leukocyte', "leukocyte's", 'leukocytes']

It uses the stemming module that can be installed with

pip install stemming


来源:https://stackoverflow.com/questions/31596476/all-possible-wordform-completions-of-a-biomedical-words-stem

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!