I would like to have a word (e.g. "Apple) and process a text (or maybe more). I'd like to come up with related terms. For example: process a document for Apple and find that iPod, iPhone, Mac are terms related to "Apple".
Any idea on how to solve this?
As a starting point: your question relates to text mining.
There are two ways: a statistical approach, and one form natural language processing (nlp).
I do not know much about nlp, but can say something about the statistical approach:
You need some vector space representation of your documents, see http://en.wikipedia.org/wiki/Vector_space_model http://en.wikipedia.org/wiki/Document-term_matrix http://en.wikipedia.org/wiki/Tf%E2%80%93idf
In order to learn semantics, that is: different words mean the same, or one word can have different meanings, you need a large text corpus for learning. As I said this is a statistical approach, so you need lots of samples. http://www.daviddlewis.com/resources/testcollections/
Maybe you have lots of documents from the context you are going to use. That is the best situation.
You have to retrieve latent factors from this corpus. Most common are:
- LSA (http://en.wikipedia.org/wiki/Latent_semantic_analysis)
- PLSA (http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis)
- nonnegative matrix factorization (http://en.wikipedia.org/wiki/Non-negative_matrix_factorization)
- latent dirichlet allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
These methods involve lots of math. Either you dig it, or you have to find good libraries.
I can recommend the following books:
Like all of AI, it's a very difficult problem. You should look into natural language processing to learn about some of the issues.
One very, very simplistic approach can be to build a 2d-table of words, with for each pair of words the average distance (in words) that they appear in the text. Obviously you'll need to limit the maximum distance considered, and possibly the number of words as well. Then, after processing a lot of text you'll have an indicator of how often certain words appear in the same context.
What I would do is get all the words in a text and make a frequency list (how often each word appears). Maybe also add to it a heuristic factor on how far the word is from "Apple". Then read multiple documents, and cross out words that are not common in all the documents. Then prioritize based on the frequency and distance from the keyword. Of course, you will get a lot of garbage and possibly miss some relevant words, but by adjusting the heuristics you should get at least some decent matches.
The technique that you are looking for is called Latent Semantic Analysis (LSA). It is also sometimes called Latent Semantic Indexing. The technique operates on the idea that related concepts occur together in text. It uses statistics to build the word relationships. Given a large enough corpus of documents it will definitely solve your problem of finding related words.
Take a look at vector space models.
来源:https://stackoverflow.com/questions/7544266/algorithm-to-find-related-words-in-a-text