information-retrieval

How to create more complex Lucene query strings?

老子叫甜甜 提交于 2019-12-07 05:20:41
问题 This question is a spin-off from this question. My inquiry is two-fold, but because both are related I think it is a good idea to put them together. How to programmatically create queries. I know I could start creating strings and get that string parsed with the query parser. But as I gather bits and pieces of information from other resources, there is a programattical way to do this. What are the syntax rules for the Lucene queries? --EDIT-- I'll give a requirement example for a query I

Self-indexing (and traditional indexing) algorithms - Implementations and advice to share?

守給你的承諾、 提交于 2019-12-07 05:06:12
问题 As part of a research project I'm currently looking for open-source implementations of self-indexing algorithms, i.e. a compressed form of the traditional inverted index yielding nice characteristics such as faster lookup and/or less consumed space. Do you know of any open-source implementations of self-indexing algorithms? Do you have other interesting takes on indexing algorithms or data structures to share? All languages and license variants are welcome. 回答1: Here is a nice introductory

Is there an algorithm for determining the relevance of a text to a theme?

一曲冷凌霜 提交于 2019-12-06 15:02:30
I want to know what can be used to determine the relevance of a page for a theme like games, movies, etc. Is there some research in this area or is there only counting how many times some relevant words appear? The common choice is supervised document classification on bag of words (or bag of n-grams) features, preferably with tf-idf weighting. Popular algorithms include Naive Bayes and (linear) SVMs. For this approach, you'll need labeled training data, i.e. documents annotated with relevant themes. See, e.g., Introduction to Information Retrieval , chapters 13-15. 来源: https://stackoverflow

retrieve information from a url

ⅰ亾dé卋堺 提交于 2019-12-06 12:52:13
I want to make a program that will retrieve some information a url. For example i give the url below, from librarything How can i retrieve all the words below the "TAGS" tab, like Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ? I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice? EDIT: You gave me excellent help, but I want to ask something else. For every tag we can see how many times each tag has been used, when we press the "number" button. How can I retrieve that number also? You

How to web scrape daily news once a day using Python?

时光总嘲笑我的痴心妄想 提交于 2019-12-06 09:05:15
问题 I am trying to build an application for which I need daily news feed from several websites. One way to do this is by using BeautifulSoup library of Python. However this is good for pages which have their news on one static page. Let's consider a site like http://www.techcrunch.com. They have only one their headlines and for more news you need to click on "Read more". For several other news websites, it is similar. How do I extract such information and dump it in a file- txt/.dmp or any other

Which open-source search engine should be used? [closed]

只愿长相守 提交于 2019-12-06 06:09:00
Closed . This question is opinion-based . It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post . Closed 6 years ago . My aim is to build an aggregrator of news feeds and blog feeds so as to make searching/tracking of entitites in it easy. I have been looking at many solutions out there like Terrier, Lucene, SWISH-E, etc. Basically, I could find only 2 sources of comparison studies done on these engines and one of them is kinda outdated. Basically I want a search engine which would be used

What is the correct version of Average precision?

▼魔方 西西 提交于 2019-12-06 05:44:39
问题 I'm trying to compute the Average Precision (and Mean Average Precision) on the Oxford Building image dataset. Below there is the code that they provide for computing Average Precision. Notice that pos_set is the union of the "optimal" and "good" images from the ground trouth set, while junk_set is a set of not-relevant images. void OxfordTest::computeAp(std::vector<std::string> &ranked_list){ float old_recall = 0.0; float old_precision = 1.0; float ap = 0.0; size_t intersect_size = 0; size_t

How to handle huge sparse matrices construction using Scipy?

老子叫甜甜 提交于 2019-12-05 18:31:29
So, I am working on a Wikipedia dump to compute the pageranks of around 5,700,000 pages give or take. The files are preprocessed and hence are not in XML. They are taken from http://haselgrove.id.au/wikipedia.htm and the format is: from_page(1): to(12) to(13) to(14).. from_page(2): to(21) to(22).. . . . from_page(5,700,000): to(xy) to(xz) so on. So. basically it's a construction of a [5,700,000*5,700,000] matrix, which would just break my 4 gigs of RAM. Since, it is very-very Sparse, that makes it easier to store using scipy.lil.sparse or scipy.dok.sparse , now my issue is: How on earth do I

Return all tweets from my timeline

ぐ巨炮叔叔 提交于 2019-12-05 17:54:32
I wish to return ALL the Tweets I have ever posted on my timeline. I am using the Linq To Twitter library as so - var statusTweets = from tweet in twitterCtx.Status where tweet.Type == StatusType.User && tweet.UserID == MyUserID && tweet.Count == 200 select tweet; statusTweets.ToList().ForEach( tweet => Console.WriteLine( "Name: {0}, Tweet: {1}\n", tweet.User.Name, tweet.Text)); This works fine and brings back the first 200. However the first 200 seems to be the maximum I can retrieve, as there a way to bring back say 1000? There does not seem to be a next cursor movement option like there is

How to remove OCR artifacts from text?

本秂侑毒 提交于 2019-12-05 13:32:24
OCR generated texts sometimes come with artifacts, such as this one: Diese grundsätzliche V e r b o r g e n h e i t Gottes, die sich n u r dem N a c h f o l g e r ö f f n e t , ist m i t d e m Messiasgeheimnis gemeint While it is not unusual, that the spacing between letters is used as emphasis (probably due to early printing press limitations), it is unfavorable for retrieval tasks. How can one turn the above text into a more, say, canonical form, like: Diese grundsätzliche Verborgenheit Gottes, die sich nur dem Nachfolger öffnet, ist mit dem Messiasgeheimnis gemeint Can this be done