How to cluster search engine keywords?

泪湿孤枕 提交于 2019-12-09 18:34:09

问题


From Google Analytics I have a (long) list of keywords that people used in search engines to find my website. I want to find the 'core keywords', hypothetical example:

java online training
learning java
scala training
training for java
online training java
learn scala programming

The ideal result would be: 'java', 'online training', 'training', 'scala' and 'learn'.

The difficulty seems to be detecting complete phrases, ignoring common words (for) and handling variations (learn-learning).

Is there a library that can do that (preferably for JVM)? Or is there a suitable algorithm I can implement myself?


回答1:


This is a term or keyword extraction problem. I did a search and it turned up Kea, which looks to be very much what you want.

You can implement a naive solution by the following algorithm:

  • generate a list of ngrams in the document with the phrase length that you want (chose an arbitrary phrase length limit, like 3 or 4)
  • put the ngram into a Multiset
  • iterate over the entries of the multiset in the order of their degree or count, perhaps with an arbitrary cutoff

Like you said, this will have a problem with stopwords. You can do something simple like have a dictionary of stopwords, or you can do something like Term Frequency-Inverse Document Frequency which can help you automatically recognize very frequent terms. KEA will do this for you, it might be best to look into that first.

Hope that helps!



来源:https://stackoverflow.com/questions/4617023/how-to-cluster-search-engine-keywords

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!