LSA - Latent Semantic Analysis - How to code it in PHP?

前端 未结 4 1643
青春惊慌失措
青春惊慌失措 2020-12-05 03:56

I would like to implement Latent Semantic Analysis (LSA) in PHP in order to find out topics/tags for texts.

Here is what I think I have to do. Is this correc

4条回答
  •  甜味超标
    2020-12-05 03:57

    This answer isn't directly to the posters' question, but to the meta question of how to autotag news items. The OP mentions Named Entity Recognition, but I believe they mean something more along the line of autotagging. If they really mean NER, then this response is hogwash :)

    Given these constraints (600 items / day, 100-200 characters / item) with divergent sources, here are some tagging options:

    1. By hand. An analyst could easily do 600 of these per day, probably in a couple of hours. Something like Amazon's Mechanical Turk, or making users do it, might also be feasible. Having some number of "hand-tagged", even if it's only 50 or 100, will be a good basis for comparing whatever the autogenerated methods below get you.

    2. Dimentionality reductions, using LSA, Topic-Models (Latent Dirichlet Allocation), and the like.... I've had really poor luck with LSA on real-world data sets and I'm unsatisfied with its statistical basis. LDA I find much better, and has an incredible mailing list that has the best thinking on how to assign topics to texts.

    3. Simple heuristics... if you have actual news items, then exploit the structure of the news item. Focus on the first sentence, toss out all the common words (stop words) and select the best 3 nouns from the first two sentences. Or heck, take all the nouns in the first sentence, and see where that gets you. If the texts are all in english, then do part of speech analysis on the whole shebang, and see what that gets you. With structured items, like news reports, LSA and other order independent methods (tf-idf) throws out a lot of information.

    Good luck!

    (if you like this answer, maybe retag the question to fit it)

提交回复
热议问题