How to extract common / significant phrases from a series of text entries

前端 未结 4 941
迷失自我
迷失自我 2020-12-07 07:09

I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not e

4条回答
  •  北海茫月
    2020-12-07 07:23

    Well, for a start you would probably have to remove all HTML tags (search for "<[^>]*>" and replace it with ""). After that, you could try the naive approach of looking for the longest common substrings between every two text items, but I don't think you'd get very good results. You might do better by normalizing the words (reducing them to their base form, removing all accents, setting everything to lower or upper case) first and then analyse. Again, depending on what you want to accomplish, you might be able to cluster the text items better if you allow for some word order flexibility, i.e. treat the text items as bags of normalized words and measure bag content similarity.

    I've commented on a similar (although not identical) topic here.

提交回复
热议问题