information-retrieval

How can IDF be different for several documents?

时光毁灭记忆、已成空白 提交于 2020-01-06 02:58:25
问题 I am using LETOR to make an information retrieval system. They use TF and IDF. I am sure TF is query-dependent. But IDF should be to, but: "Note that IDF is document independent, and so all the documents under a query have same IDF values." But that does not make sense because IDF is part of the feature list. How will IDF for each document be calculated? 回答1: IDF is term specific. The IDF of any given term is document independent, but the TF is document specific. To say it differently. Let's

Ways to create a huge inverted index

送分小仙女□ 提交于 2020-01-03 03:05:12
问题 I want to create a big inverted index of around 10 6 terms. What method would you suggest? I'm thinking in fast binary key store DBs like Tokyo cabinet, voldemort, etc. Edit: I've tried MySQL in the past for storing a table of two integers to represent the inverted index, but even with the first column having a db index, queries were very slow. I think for those situations a SQL database has too much overhead, overhead of transactions, query parsing, etc. I'm searching for what technologies

Is it possible to query Elastic Search with a feature vector?

拟墨画扇 提交于 2020-01-01 08:51:13
问题 I'd like to store an n-dimensional feature vector, e.g. <1.00, 0.34, 0.22, ..., 0> , with each document, and then provide another feature vector as a query, with the results sorted in order of cosine similarity. Is this possible with Elastic Search? 回答1: I don't have an answer particular to Elastic Search because I've never used it (I use Lucene on which Elastic search is built). However, I'm trying to give a generic answer to your question. There are two standard ways to obtain the nearest

How to calculate “Average Precision and Ranking” for CBIR system

試著忘記壹切 提交于 2019-12-30 07:04:14
问题 So, for I have implemented basic cbir system using RGB histograms. Now, I am trying to generate average precision and ranking curves. I need to know that, Is my formula for avg precision correct? and how to calculate average rankings? Code: % Dir: parent directory location for images folder c1, c2, c3 % inputImage: \c1\1.ppm % For example to get P-R curve execute: CBIR('D:\visionImages','\c2\1.ppm'); function [ ] = demoCBIR( Dir,inputImage) % Dir='D:\visionImages'; % inputImage='\c3\1.ppm';

some ideas and direction of how to measure ranking, AP, MAP, recall for IR evaluation

戏子无情 提交于 2019-12-30 05:35:19
问题 I have question about how to evaluate the information retrieve result is good or not such as calculate the relevant document rank, recall, precision ,AP, MAP..... currently, the system is able to retrieve the document from the database once the users enter the query. The problem is I do not know how to do the evaluation. I got some public data set such as "Cranfield collection" dataset link it contains 1.document 2.query 3.relevance assesments DOCS QRYS SIZE* Cranfield 1,400 225 1.6 May I

Python script to find word frequencies of a given document

谁都会走 提交于 2019-12-30 05:31:05
问题 I am looking for a simple script that can find frequencies of words for a given document (probably by using portable stemmer). Is there any library or simple script that does this process? 回答1: use nltk import nltk YOUR_STRING = "Your words" words = [w for w in YOUR_STRING.split()] freq_dist = nltk.FreqDist(words) tokens = freq_dist.keys() #50 most frequent most_frequent = tokens[:50] #50 least frequent least_frequent = tokens[-50:] 回答2: You should be able to count words. Use a collections

Clustering from the cosine similarity values

…衆ロ難τιáo~ 提交于 2019-12-30 05:25:08
问题 I have extracted words from a set of URLs and calculated cosine similarity between each URL's contents.And also I have normalized the values between 0-1(using Min-Max).Now i need to cluster the URLs based on cosine similarity values to find out similar URLs.which clustering algorithm will be most suitable?.Please suggest me a Dynamic clustering method because it will be useful since i could increase number of URL's on demand and also it will be more natural.Please correct me if you feel i'm

Good documentation on structure tcp_info [closed]

点点圈 提交于 2019-12-30 01:06:06
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . I am working on getting the performance parameters of a tcp connection and one these parameters is the bandwidth. I am intending to use the tcp_info structure supported from linux 2.6 onwards, which holds the meta data about a tcp connection. The information can be retrieved using the getsockopt() function call

Calculating tf-idf among documents using python 2.7

非 Y 不嫁゛ 提交于 2019-12-29 08:08:27
问题 I have a scenario where i have retreived information/raw data from the internet and placed them into their respective json or .txt files. From there on i would like to calculate the frequecies of each term in each document and their cosine similarity by using tf-idf. For example: there are 50 different documents/texts files that consists 5000 words/strings each i would like to take the first word from the first document/text and compare all the total 250000 words find its frequencies then do

How can I retrieve my Google search history?

空扰寡人 提交于 2019-12-29 06:55:16
问题 In the Google Web History interface I can see all the search queries I have used over the years, and the pages I visited for a particular query. Is there a way I can retrieve this history using a computer program? I couldn't find a Google API that does it. Do you know of a tool that can do this, or suggest a way to achieve this? 回答1: There's an RSS feed. Update : the link is now broken. 回答2: The RSS feed in the accepted answer above does not exist anymore. Google does not provide an API that