information-retrieval

Information retrieval (IR) vs data mining vs Machine Learning (ML)

▼魔方 西西 提交于 2019-11-28 16:31:03
问题 People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them. From people with experience in these fields, what exactly draws the line between these? 回答1: This is just the view of one person (formally trained in ML); others might see things quite differently. Machine Learning is probably the most homogeneous of these three terms, and the most consistently applied--it's limited to the pattern-extraction (or pattern-matching) algorithms

What tried and true algorithms for suggesting related articles are out there?

爱⌒轻易说出口 提交于 2019-11-28 03:04:48
Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related. Let's assume very little metadata about each item. That is, no tags, categories. Treat as one big blob of text, including the title and author name. How do you go about finding the possibly related documents? I'm rather interested in the actual algorithm, not ready solutions, although I'd be ok with taking a look at something implemented in ruby or python, or relying on mysql or pgsql

Image retrieval system by Colour from the web using C++ with openframeworks

大憨熊 提交于 2019-11-28 00:34:50
I am writing a program in C++ and openFrameworks that should hopefully implement an image retrieval system by colour matching. I have got an algorithm to find the match in a database by an rgb value. For example, if I have a database of 1000 pictures on my computer and I have a query rgb value 255,0,0 the program would look through 1000 pictures and find the closest match. However, my problem is that I want it to also look for the match on the web. I have been trying to find how to get images from websites, however, if you don't know the specific url of the image it's hard to get hold of the

What is the best way to compute trending topics or tags?

家住魔仙堡 提交于 2019-11-27 16:33:36
Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions. I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which are always hot less. The topics which normally (almost) noone mentions should be the hottest ones. Google offers "Hot Trends", topix.com shows "Hot Topics", fav.or.it shows "Keyword Trends" - all these services have one thing in common: They only show you

Cosine similarity and tf-idf

核能气质少年 提交于 2019-11-27 09:52:47
问题 I am confused by the following comment about TF-IDF and Cosine Similarity . I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90." Now I'm wondering....aren't they 2 different things? Is tf-idf already inside the cosine

How to detect duplicates among text documents and return the duplicates' similarity?

半世苍凉 提交于 2019-11-27 09:35:28
I'm writing a crawler to get content from some website, but the content can duplicated, I want to avoid that. So I need a function can return the same percent between two text to detect two content maybe duplicated Example: Text 1:"I'm writing a crawler to" Text 2:"I'm writing a some text crawler to get" The compare function will return text 2 as the same text 1 by 5/8%(with 5 is words number of text 2 same text 1(compare by word order), and 8 is total words of text 2). If remove the "some text" then text 2 as the same text 1(I need detect the situation).How can I do that? You are facing a

What is the default list of stopwords used in Lucene's StopFilter?

家住魔仙堡 提交于 2019-11-27 07:51:19
Lucene have a default stopfilter ( http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/core/StopFilter.html ), does anyone know which are words in the list? The default stop words set in StandardAnalyzer and EnglishAnalyzer is from StopAnalyzer.ENGLISH_STOP_WORDS_SET , and they are: "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with" StopFilter itself defines no default set of stop words. 来源:

Computing similarity between two lists

两盒软妹~` 提交于 2019-11-27 05:40:42
问题 EDIT: as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other. Eg, 1,7,4,5,8,9 1,7,5,4,9,6 What is a good measure of similarity between these two lists so that order is important. For example, we should penalize similarity as 4,5 is swapped in the two lists? I have 2 systems. One state of the art system and one system that I implemented. Given a query, both systems return a ranked list of

How to parse the data from Google Alerts?

我怕爱的太早我们不能终老 提交于 2019-11-27 03:21:16
Firstly, How would you get Google Alerts information into a database other than to parse the text of the email message that Google sends you? It seems that there is no Google Alerts API. If you must parse text, how would you go about parsing out the relevant pieces of the email message? When you create the alert, set the "Deliver To" to "Feed" and then you can consume the feed XML as you would any other feed. This is much easier to parse and digest into a database. class googleAlerts{ public function createAlert($alert){ $USERNAME = 'XXXXXX@gmail.com'; $PASSWORD = 'YYYYYY'; $COOKIEFILE =

Fast/Optimize N-gram implementations in python

被刻印的时光 ゝ 提交于 2019-11-27 02:05:48
Which ngram implementation is fastest in python? I've tried to profile nltk's vs scott's zip ( http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/ ): from nltk.util import ngrams as nltkngram import this, time def zipngram(text,n=2): return zip(*[text.split()[i:] for i in range(n)]) text = this.s start = time.time() nltkngram(text.split(), n=2) print time.time() - start start = time.time() zipngram(text, n=2) print time.time() - start [out] 0.000213146209717 6.50882720947e-05 Is there any faster implementation for generating ngrams in python? Some attempts with some