How to auto-tag content, algorithms and suggestions needed

后端 未结 8 1862
我在风中等你
我在风中等你 2020-12-22 18:51

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.

I am now searching for ways to help m

8条回答
  •  谎友^
    谎友^ (楼主)
    2020-12-22 19:05

    You should use a metric such as tf-idf to get the tags out:

    1. Count the frequency of each term per document. This is the term frequency, tf(t, D). The more often a term occurs in the document D, the more important it is for D.
    2. Count, per term, the number of documents the term appears in. This is the document frequency, df(t). The higher df, the less the term discriminates among your documents and the less interesting it is.
    3. Divide tf by the log of df: tfidf(t, D) = tf(t, D) / log(df(D) + 1).
    4. For each document, declare the top k terms by their tf-idf score to be the tags for that document.

    Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn.

    If you want to do better than this, use language models. That requires some knowledge of probability theory.

提交回复
热议问题