How to auto-tag content, algorithms and suggestions needed

后端 未结 8 1843
我在风中等你
我在风中等你 2020-12-22 18:51

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.

I am now searching for ways to help m

8条回答
  •  执笔经年
    2020-12-22 19:05

    If I understand your question correctly, you'd like to group the articles into similarity classes. For example, you might assign article 1 to 'Sports', article 2 to 'Politics', and so on. Or if your classes are much finer-grained, the same articles might be assigned to 'Dallas Mavericks' and 'GOP Presidential Race'.

    This falls under the general category of 'clustering' algorithms. There are many possible choices of such algorithms, but this is an active area of research (meaning it is not a solved problem, and thus none of the algorithms are likely to perform quite as well as you'd like).

    I'd recommend you look at Latent Direchlet Allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) or 'LDA'. I don't have personal experience with any of the LDA implementations available, so I can't recommend a specific system (perhaps others more knowledgeable than I might be able to recommend a user-friendly implementation).

    You might also consider the agglomerative clustering implementations available in LingPipe (see http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html), although I suspect an LDA implementation might prove somewhat more reliable.

    A couple questions to consider while you're looking at clustering systems:

    • Do you want to allow fractional class membership - e.g. consider an article discussing the economic outlook and its potential effect on the presidential race; can that document belong partly to the 'economy' cluster and partly to the 'election' cluster? Some clustering algorithms allow partial class assignment and some do not

    • Do you want to create a set of classes manually (i.e., list out 'economy', 'sports', ...), or do you prefer to learn the set of classes from the data? Manual class labels may require more supervision (manual intervention), but if you choose to learn from the data, the 'labels' will likely not be meaningful to a human (e.g., class 1, class 2, etc.), and even the contents of the classes may not be terribly informative. That is, the learning algorithm will find similarities and cluster documents it considers similar, but the resulting clusters may not match your idea of what a 'good' class should contain.

提交回复
热议问题