information-retrieval

RAKE with GENSIM

[亡魂溺海] 提交于 2019-12-02 03:57:35
I am trying to calculate similarity. First of all i used RAKE library to extract the keywords from the crawled jobs. Then I put the keywords of every jobs into separate array and then combined all those arrays into documentArray. documentArray = ['Anger command,Assertiveness,Approachability,Adaptability,Authenticity,Aggressiveness,Analytical thinking,Molecular Biology,Molecular Biology,Molecular Biology,molecular biology,molecular biology,Master,English,Molecular Biology,,Islamabad,Islamabad District,Islamabad Capital Territory,Pakistan,,Rawalpindi,Rawalpindi,Punjab,Pakistan'"], ['competitive

Retrieve public statistics of video via youtube api

戏子无情 提交于 2019-12-01 07:03:18
问题 It's possible to obtain public statistics of video? Using something like this i can get just total views of video and like count: https://www.googleapis.com/youtube/v3/videos?part=statistics&key=API_KEY&id=ekzHIouo8Q4 It's possible to get those public statistics? I found this question Youtube GData API : Retrieving public statistics But maybe something has changed? 回答1: The only API call under Version 3 of the API that will get you statistics is the youtube.videos.list API Try this API

Why Lucene doesn't support any type of update to an existing document

ぃ、小莉子 提交于 2019-12-01 01:47:29
问题 My use case involves index a Lucene document, then on multiple future occasions add terms that point to this existing doc, that's without deleting and re-adding the entire document for each new term (because of performance, and not keeping the original terms). I do know that a document can not be truly updated. My question is why? Or more precisely, why are all forms of updates (terms, stored fields) not supported? Why it's not possible to add another term to point to an existing document -

Boost fresh documents with Lucene

吃可爱长大的小学妹 提交于 2019-11-30 21:11:42
Does Lucene provide a means to boost fresh documents? For example suppose that the Lucene document includes a date field. Is it possible, without having the user to alter her query anyhow, to present the most recent documents with a higher score? I do not want to resort to a coarse "sort by date" solution as it will completely cancel the scoring algorithm. Use Document.setBoost(float value) when putting documents into the index. You can either constantly re-adjust the value on existing documents, OR have a float value that increments with date, so that you only need to apply it to the time

Facebook Graph Search: Information Retrieval Algorithm

不问归期 提交于 2019-11-30 19:07:38
问题 There is a closed question titled "How does Facebook Graph Search work?" In simplest terms, the OP asked (and even gave a sample of what he tried): How does Facebook Graph Search works? He gave an example: Friends from France who likes England How can the above be implemented as a real world Information Retrieval problem? As my answer was not fitting in the comment so thought of re-framing the question and answering it well in Stack Overflow Q&A style. 回答1: From an implementation perspective

Python script to find word frequencies of a given document

巧了我就是萌 提交于 2019-11-30 16:44:07
I am looking for a simple script that can find frequencies of words for a given document (probably by using portable stemmer). Is there any library or simple script that does this process? use nltk import nltk YOUR_STRING = "Your words" words = [w for w in YOUR_STRING.split()] freq_dist = nltk.FreqDist(words) tokens = freq_dist.keys() #50 most frequent most_frequent = tokens[:50] #50 least frequent least_frequent = tokens[-50:] You should be able to count words. Use a collections.Counter or a dict , depending on what you need. That part is easy, but if it isn't you can find the answer by

some ideas and direction of how to measure ranking, AP, MAP, recall for IR evaluation

笑着哭i 提交于 2019-11-30 16:33:29
I have question about how to evaluate the information retrieve result is good or not such as calculate the relevant document rank, recall, precision ,AP, MAP..... currently, the system is able to retrieve the document from the database once the users enter the query. The problem is I do not know how to do the evaluation. I got some public data set such as "Cranfield collection" dataset link it contains 1.document 2.query 3.relevance assesments DOCS QRYS SIZE* Cranfield 1,400 225 1.6 May I know how to use do the evaluation by using "Cranfield collection" to calculate the relevant document rank,

Document search on partial words

你说的曾经没有我的故事 提交于 2019-11-30 13:03:52
I am looking for a document search engine (like Xapian, Whoosh, Lucene, Solr, Sphinx or others) which is capable of searching partial terms. For example when searching for the term "brit" the search engine should return documents containing either "britney" or "britain" or in general any document containing a word matching r *brit* Tangentially, I noticed most engines use TF-IDF (Term frequency-Inverse document frequency) or its derivatives which are based on full terms and not partial terms. Are there any other techniques that have been successfully implemented besides TF-IDF for document

Good documentation on structure tcp_info [closed]

会有一股神秘感。 提交于 2019-11-30 05:19:28
I am working on getting the performance parameters of a tcp connection and one these parameters is the bandwidth. I am intending to use the tcp_info structure supported from linux 2.6 onwards, which holds the meta data about a tcp connection. The information can be retrieved using the getsockopt() function call on tcp_info . I have spent lot of time finding a good documentation which explains all the parameters in that structure, but couldn't find one. Also I tested a small program to retrieve the values from tcp_info for a tcp connection where I found the measured MSS values for most of the

Boost fresh documents with Lucene

╄→гoц情女王★ 提交于 2019-11-30 04:55:28
问题 Does Lucene provide a means to boost fresh documents? For example suppose that the Lucene document includes a date field. Is it possible, without having the user to alter her query anyhow, to present the most recent documents with a higher score? I do not want to resort to a coarse "sort by date" solution as it will completely cancel the scoring algorithm. 回答1: Use Document.setBoost(float value) when putting documents into the index. You can either constantly re-adjust the value on existing