information-retrieval | 易学教程

Information retrieval (IR) vs data mining vs Machine Learning (ML)

阅读更多关于 Information retrieval (IR) vs data mining vs Machine Learning (ML)

问题 People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them. From people with experience in these fields, what exactly draws the line between these? 回答1: This is just the view of one person (formally trained in ML); others might see things quite differently. Machine Learning is probably the most homogeneous of these three terms, and the most consistently applied--it's limited to the pattern-extraction (or pattern-matching) algorithms

What tried and true algorithms for suggesting related articles are out there?

阅读更多关于 What tried and true algorithms for suggesting related articles are out there?

Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related. Let's assume very little metadata about each item. That is, no tags, categories. Treat as one big blob of text, including the title and author name. How do you go about finding the possibly related documents? I'm rather interested in the actual algorithm, not ready solutions, although I'd be ok with taking a look at something implemented in ruby or python, or relying on mysql or pgsql

Image retrieval system by Colour from the web using C++ with openframeworks

阅读更多关于 Image retrieval system by Colour from the web using C++ with openframeworks

I am writing a program in C++ and openFrameworks that should hopefully implement an image retrieval system by colour matching. I have got an algorithm to find the match in a database by an rgb value. For example, if I have a database of 1000 pictures on my computer and I have a query rgb value 255,0,0 the program would look through 1000 pictures and find the closest match. However, my problem is that I want it to also look for the match on the web. I have been trying to find how to get images from websites, however, if you don't know the specific url of the image it's hard to get hold of the

What is the best way to compute trending topics or tags?

阅读更多关于 What is the best way to compute trending topics or tags?

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions. I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which are always hot less. The topics which normally (almost) noone mentions should be the hottest ones. Google offers "Hot Trends", topix.com shows "Hot Topics", fav.or.it shows "Keyword Trends" - all these services have one thing in common: They only show you

Cosine similarity and tf-idf

阅读更多关于 Cosine similarity and tf-idf

问题 I am confused by the following comment about TF-IDF and Cosine Similarity . I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90." Now I'm wondering....aren't they 2 different things? Is tf-idf already inside the cosine

How to detect duplicates among text documents and return the duplicates' similarity?

阅读更多关于 How to detect duplicates among text documents and return the duplicates' similarity?

I'm writing a crawler to get content from some website, but the content can duplicated, I want to avoid that. So I need a function can return the same percent between two text to detect two content maybe duplicated Example: Text 1:"I'm writing a crawler to" Text 2:"I'm writing a some text crawler to get" The compare function will return text 2 as the same text 1 by 5/8%(with 5 is words number of text 2 same text 1(compare by word order), and 8 is total words of text 2). If remove the "some text" then text 2 as the same text 1(I need detect the situation).How can I do that? You are facing a

What is the default list of stopwords used in Lucene's StopFilter?

阅读更多关于 What is the default list of stopwords used in Lucene's StopFilter?

Lucene have a default stopfilter ( http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/core/StopFilter.html ), does anyone know which are words in the list? The default stop words set in StandardAnalyzer and EnglishAnalyzer is from StopAnalyzer.ENGLISH_STOP_WORDS_SET , and they are: "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with" StopFilter itself defines no default set of stop words. 来源：

Computing similarity between two lists

阅读更多关于 Computing similarity between two lists

问题 EDIT: as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other. Eg, 1,7,4,5,8,9 1,7,5,4,9,6 What is a good measure of similarity between these two lists so that order is important. For example, we should penalize similarity as 4,5 is swapped in the two lists? I have 2 systems. One state of the art system and one system that I implemented. Given a query, both systems return a ranked list of

How to parse the data from Google Alerts?

阅读更多关于 How to parse the data from Google Alerts?

Firstly, How would you get Google Alerts information into a database other than to parse the text of the email message that Google sends you? It seems that there is no Google Alerts API. If you must parse text, how would you go about parsing out the relevant pieces of the email message? When you create the alert, set the "Deliver To" to "Feed" and then you can consume the feed XML as you would any other feed. This is much easier to parse and digest into a database. class googleAlerts{ public function createAlert($alert){ $USERNAME = 'XXXXXX@gmail.com'; $PASSWORD = 'YYYYYY'; $COOKIEFILE =

Fast/Optimize N-gram implementations in python

阅读更多关于 Fast/Optimize N-gram implementations in python

Which ngram implementation is fastest in python? I've tried to profile nltk's vs scott's zip ( http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/ ): from nltk.util import ngrams as nltkngram import this, time def zipngram(text,n=2): return zip(*[text.split()[i:] for i in range(n)]) text = this.s start = time.time() nltkngram(text.split(), n=2) print time.time() - start start = time.time() zipngram(text, n=2) print time.time() - start [out] 0.000213146209717 6.50882720947e-05 Is there any faster implementation for generating ngrams in python? Some attempts with some