How to select stop words using tf-idf? (non english corpus)

戏子无情 提交于 2020-01-11 20:01:10

问题


I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document.


回答1:


Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune.

The best (as in more representative) terms in a document are those with higher tf-idf because those terms are common in the document, while being rare in the collection.

As a quick note, as @Kevin pointed out, very common terms in the collection (i.e., stop-words) produce very low tf-idf anyway. However, they will change some computations and this would be wrong if you assume they are pure noise (which might not be true depending on the task). In addition, if they are included your algorithm would be slightly slower.

edit: As @FelipeHammel says, you can directly use the IDF (remember to invert the order) as a measure which is (inversely) proportional to df. This is completely equivalent for ranking purposes, and therefore to select the top "k" terms. However, it is not possible to use it to select based on ratios (e.g., words that appear in more than 50% of the documents), although a simple thresholding will fix that (i.e., selecting terms with idf lower than a specific value). In general, a fix number of terms is used.

I hope this helps.




回答2:


From "Introduction to Information Retrieval" book:

tf-idf assigns to term t a weight in document d that is

  1. highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents);
  2. lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
  3. lowest when the term occurs in virtually all documents.

So words with lowest tf-idf can considered as stop words.



来源:https://stackoverflow.com/questions/16927494/how-to-select-stop-words-using-tf-idf-non-english-corpus

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!