Unsupervised automatic tagging algorithms?

牧云@^-^@ 提交于 2019-12-02 14:07:12

The most common unsupervised machine learning model for this type of task is Latent Dirichlet Allocation (LDA). This model automatically infers a collection of topics over a corpus of documents based on the words in those documents. Running LDA on your set of documents would assign words with probability to certain topics when you search for them, and then you could retrieve the documents with the highest probabilities to be relevant to that word.

There have been some extensions to images and music as well, see http://cseweb.ucsd.edu/~dhu/docs/research_exam09.pdf.

LDA has several efficient implementations in several languages:

U Avalos

These guys propose an alternative to LDA.

Automatic Tag Recommendation Algorithms for Social Recommender Systems http://research.microsoft.com/pubs/79896/tagging.pdf

Haven't read thru the whole paper but they have two algorithms:

  1. Supervised learning version. This isn't that bad. You can use Wikipedia to train the algorithm
  2. "Prototype" version. Haven't had a chance to go thru this but this is what they recommend

UPDATE: I've researched this some more and I've found another approach. Basically, it's a two-stage approach that's very simple to understand and implement. While too slow for 100,000s of documents, it (probably) has good performance for 1000s of docs (so it's perfect for tagging a single user's documents). I'm going to try this approach and will report back on performance/usability.

In the mean time, here's the approach:

  1. Use TextRank as per http://qr.ae/36RAP to generate a tag list for a single document. This generates a tag list for a single document independent of other documents.
  2. Use the algorithm from "Using Machine Learning to Support Continuous Ontology Development" (https://www.researchgate.net/publication/221630712_Using_Machine_Learning_to_Support_Continuous_Ontology_Development) to integrate the tag list (from step 1) into the existing tag list.

Text documents can be tagged using this keyphrase extraction algorithm/package. http://www.nzdl.org/Kea/ Currently it supports limited type of documents (Agricultural and medical I guess) but you can train it according to your requirements.

I'm not sure how would the image/video part work out, unless you're doing very accurate object detection (which has it's own shortcomings). How are you planning to do it ?

I posted a blog article today to answer your question.

http://scottge.net/2015/06/30/automatic-image-and-video-tagging/

There are basically two approaches to automatically extract keywords from images and videos.

  1. Multiple Instance Learning (MIL)
  2. Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants

In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.

Thanks, Scott

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!