I'm familiar with word stemming and completion from the tm package in R.
I'm trying to come up with a quick and dirty method for finding all variants of a given word (within some corpus.) For example, I'd like to get "leukocytes" and "leuckocytic" if my input is "leukocyte".
If I had to do it right now, I would probably just go with something like:
library(tm)
library(RWeka)
dictionary <- unique(unlist(lapply(crude, words)))
grep(pattern = LovinsStemmer("company"),
ignore.case = T, x = dictionary, value = T)
I used Lovins because Snowball's Porter doesn't seem to be aggressive enough.
I'm open to suggestions for other stemmers, scripting languages (Python?), or entirely different approaches.
This solution requires preprocessing your corpus. But once that is done it is a very quick dictionary lookup.
from collections import defaultdict
from stemming.porter2 import stem
with open('/usr/share/dict/words') as f:
words = f.read().splitlines()
stems = defaultdict(list)
for word in words:
word_stem = stem(word)
stems[word_stem].append(word)
if __name__ == '__main__':
word = 'leukocyte'
word_stem = stem(word)
print(stems[word_stem])
For the /usr/share/dict/words
corpus, this produces the result
['leukocyte', "leukocyte's", 'leukocytes']
It uses the stemming
module that can be installed with
pip install stemming
来源:https://stackoverflow.com/questions/31596476/all-possible-wordform-completions-of-a-biomedical-words-stem