Is it possible to get a natural word after it has been stemmed?

房东的猫 提交于 2019-12-24 04:41:29

问题


I have a word play which after stemming has become plai. Now I want to get play again. Is it possible? I have used Porter's Stemmer.


回答1:


Stemmer is able to process artificial non-existing words. Would you like them to be returned as elements of a set of all possible words? How do you know that the word doesn't exist and shouldn't be returned?

As an option: find a dictionary of all words and their forms. Find a stem for every of them. Save this projection as a map: ( stem, list of all word forms ). So you'll be able to get the list of all word forms for a given stem.

UPD: If you need all possible words including non-existing then I can offer such an algorithm (it's not checked, just a suggestion):

Porter stemming algorithm. We need a reversed version.

If the rule in straight algorithm has a form (m>1) E -> (delete last E) then the reversed rule would be "fork with E" which means we need to try alternative ways. E.g., in straight algorithm probate -> probat, in reversed we have two alternatives: probat -> { probat, probate }. Each of these alternatives should be separately processed further. Note that this is a set of alternatives, so we will process only distinct words. Such a rule would have the following form: A -> { , B, C }, which means "replace ending A in three alternative ways: leave as-is, with B and with C".

Step 5b: (m>1) *L -> { , +L } // Add L if there's L at the end.
Step 5a: (m>1) -> { , +E }
         (m=1 and not *o) -> { , +E } // *o is a special condition, it's not *O.
Step 4: (m>1) *S or *T -> { , +ION }
        (m>1) -> { , +AL, +ANCE, +ENCE, ..., +IVE, +IZE }
Step 3: (m>0) *AL -> { , +IZE }
        (m>0) *IC -> { , +ATE, +ITI, +AL }
        (m>0) -> { , +ATIVE, +FUL, +NESS }
Step 2: (m>0) *ATE -> { , ATIONAL } // Replace ATE.
        (m>0) *TION -> { , +AL } // Add AL at the end.
        (m>0) *ENCE -> { , ENCI } // Replace ENCE.
        ...
        (m>0) *BLE -> { , BILITI } // Replace BLE.
Step 1c: (*v*) *I -> { , Y } // Replace I.
Step 1b: (m=1 and *oE) -> { , +D, delete last E and add ING } // *o is a special condition.
         (*v*c and not (*L or *S or *Z)) -> { , add last consonant +ED, add last consonant + ING }
         *IZE -> { , IZING, +D }
         (*v*BLE) -> { , +D, delete last E and add ING }
         *ATE -> { , ATING, +D }
         (*v*) -> { , +ED, +ING }
         (m>0) *EE -> { , +D }
Step 1a: *I -> { , +ES }
         *SS -> { , +ES }
         not *S -> { , +S }

The straight algorithm had to choose first longest rule. The reversed algorithm should use all the rules.

Example (straight):

Input: PLAYING
Step 1a doesn't match.
PLAYING -> PLAY (Step 1b)
PLAY -> PLAI (Step 1c)
m=0, so the steps 2-5 don't match.
Result: PLAI

Reversed:

Input: PLAI
m=0, so the steps 2-5 are skipped
Step 1c:
PLAI -> { PLAI, PLAY }
Step 1b:
PLAI -> { PLAI, PLAIED, PLAIING }
PLAY -> { PLAY, PLAYED, PLAYING }
Resulting set: { PLAI, PLAIED, PLAIING, PLAY, PLAYED, PLAYING }
Step 1a:
PLAI -> { PLAI, PLAIS, PLAIES }
PLAIED -> { PLAIED, PLAIEDS }
PLAIING -> { PLAIING, PLAIINGS }
PLAY -> { PLAY, PLAYS }
PLAYED -> { PLAYED, PLAYEDS }
PLAYING -> { PLAYING, PLAYINGS }
Resulting set: { PLAI, PLAIS, PLAIES, PLAIED, PLAIEDS, PLAIING, PLAIINGS, PLAY, PLAYS, PLAYED, PLAYEDS, PLAYING, PLAYINGS }

I've checked all those words at Michael Tontchev's link. The result for every of them is "plai" (note that the site doesn't accept upper-case input).




回答2:


Clearly not. Many different words, after being stemmed, can become plai: including play and playing.

Try it here: http://9ol.es/porter_js_demo.html

So if, given plai, it could have come from either word, it's not deterministic. Or do you want to get the set of all possible words that stem to plai?

Update: Qualtagh mentions some good ideas.



来源:https://stackoverflow.com/questions/28439522/is-it-possible-to-get-a-natural-word-after-it-has-been-stemmed

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!