发表新帖

发表新帖

How to auto-tag content, algorithms and suggestions needed

后端未结

关注

 8  1862

我在风中等你 2020-12-22 18:51

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.

I am now searching for ways to help m

8条回答

谎友^ (楼主)

2020-12-22 19:05
You should use a metric such as tf-idf to get the tags out:
1. Count the frequency of each term per document. This is the term frequency, tf(t, D). The more often a term occurs in the document D, the more important it is for D.
2. Count, per term, the number of documents the term appears in. This is the document frequency, df(t). The higher df, the less the term discriminates among your documents and the less interesting it is.
3. Divide tf by the log of df: tfidf(t, D) = tf(t, D) / log(df(D) + 1).
4. For each document, declare the top k terms by their tf-idf score to be the tags for that document.
Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn.

If you want to do better than this, use language models. That requires some knowledge of probability theory.
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...

热议问题