Snowball Stemming: defining Regions

眉间皱痕 提交于 2020-01-03 21:09:32

问题


I'm trying to understand the snoball stemming algorithmus. The algorithmus is using two regions R1 and R2 that are definied as follows:

R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel.

R2 is the region after the first non-vowel following a vowel in R1, or is the null region at the end of the word if there is no such non-vowel.

http://snowball.tartarus.org/texts/r1r2.html

Examples are

    b   e   a   u   t   i   f   u   l
                      |<------------->|    R1
                              |<----->|    R2

   b   e   a   u   t   y
                     |<->|    R1
                       ->|<-  R2

   a   n   i   m   a   d   v   e   r   s   i   o   n
        |<----------------------------------------->|    R1
                |<--------------------------------->|    R2

   s   p   r   i   n   k   l   e   d
                     |<------------->|    R1
                                   ->|<-  R2

    e   u   c   h   a   r   i   s   t
            |<--------------------->|    R1
                        |<--------->|    R2

My question is, why is "kled" in springkled and "harist" in eucharist defined as R1? I thought the correct result would be "inkled" and "arist"?


回答1:


You should read the definition again, it says :

R1 is the region after the first non-vowel following a vowel.

Not: followed by a vowel.

In sprinkled, the first non-vowel following a vowel is n, so the region after is kled.

The same for eucharist, the first non-vowel following a vowel is c, so the region after is harist.



来源:https://stackoverflow.com/questions/31848056/snowball-stemming-defining-regions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!