information-retrieval

How to create more complex Lucene query strings?

我只是一个虾纸丫 提交于 2019-12-05 08:43:37
This question is a spin-off from this question. My inquiry is two-fold, but because both are related I think it is a good idea to put them together. How to programmatically create queries. I know I could start creating strings and get that string parsed with the query parser. But as I gather bits and pieces of information from other resources, there is a programattical way to do this. What are the syntax rules for the Lucene queries? --EDIT-- I'll give a requirement example for a query I would like to make: Say I have 5 fields: First Name Last Name Age Address Everything All fields are

Is there a better way to find set intersection for Search engine code?

﹥>﹥吖頭↗ 提交于 2019-12-05 05:18:29
问题 I have been coding up a small search engine and need to find out if there is a faster way to find set intersections. Currently, I am using a Sorted linked list as explained in most search engine algorithms. i.e for every word I have a list of documents sorted in a list and then find the intersection among the lists. The performance profiling of the case is here. Any other ideas for a faster set intersection? 回答1: An efficient way to do it is by "zig-zag": Assume your terms is a list T :

Effective 1-5 grams extraction with python

本秂侑毒 提交于 2019-12-04 17:26:21
问题 I have a huge files of 3,000,000 lines and each line have 20-40 words. I have to extract 1 to 5 ngrams from the corpus. My input files are tokenized plain text, e.g.: This is a foo bar sentence . There is a comma , in this sentence . Such is an example text . Currently, I am doing it as below but this don't seem to be a efficient way to extract the 1-5grams: #!/usr/bin/env python -*- coding: utf-8 -*- import io, os from collections import Counter import sys; reload(sys); sys

Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter

≡放荡痞女 提交于 2019-12-04 16:02:15
I have a module based on Apache Lucene 5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene doesn't filter stop words. I tried to enable stop word filtering with two different approaches. Approach #1: tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet()); tokenStream.reset(); Approach #2: tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)), StopAnalyzer.ENGLISH_STOP_WORDS_SET); tokenStream.reset(); The full code is available here: https:/

Wikipedia text download

早过忘川 提交于 2019-12-04 15:35:29
问题 I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online? To just give you some overview of my project, I want to find out the interesting words of few articles I am interested in. But to find these interesting words, I am planning to apply tf/idf to calculate term frequency for each word and pick the ones with high frequency. But to calculate the tf, I need to know the

TFIDF calculating confusion

本小妞迷上赌 提交于 2019-12-04 14:14:30
问题 I found the following code on the internet for calculating TFIDF: https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error: return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList)))) But i am confused for two things: I get negative values in some cases, is this correct? I am confused with line 62, 63 and 64. Code: documentNumber = 0 for word in documentList[documentNumber]

What is the correct version of Average precision?

让人想犯罪 __ 提交于 2019-12-04 12:48:57
I'm trying to compute the Average Precision (and Mean Average Precision ) on the Oxford Building image dataset . Below there is the code that they provide for computing Average Precision. Notice that pos_set is the union of the "optimal" and "good" images from the ground trouth set, while junk_set is a set of not-relevant images. void OxfordTest::computeAp(std::vector<std::string> &ranked_list){ float old_recall = 0.0; float old_precision = 1.0; float ap = 0.0; size_t intersect_size = 0; size_t i = 0; size_t j = 0; for ( ; i<ranked_list.size(); ++i) { if(!pos_set.count(ranked_list[i])) std:

Lemmatization of non-English words?

旧街凉风 提交于 2019-12-04 10:52:31
问题 I would like to apply lemmatization to reduce the inflectional forms of words. I know that for English language WordNet provides such a functionality, but I am also interested in applying lemmatization for Dutch, French, Spanish and Italian words. Is there any trustworthy and confirmed way to go about this? Thank you! 回答1: Try pattern library from CLIPS, they have support for German, English, Spanish, French and Italian. Just what you needed: http://www.clips.ua.ac.be/pattern Unfortunately it

What are some alternatives to a bit array?

只谈情不闲聊 提交于 2019-12-04 09:42:56
问题 I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of "set" bits in the array varies widely, from all clear to all set. Currently, I'm using a straight-forward bit array ( java.util.BitSet ), so each of my bit arrays takes several megabytes. My plan is to look at the cardinality of the first N bits, then make a decision about what data structure to use for the remainder. Clearly some data structures are better for very sparse

Java Open Source Text Mining Frameworks [closed]

一个人想着一个人 提交于 2019-12-04 07:46:26
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 8 years ago . I want to know what is the best open source Java based framework for Text Mining, to use botg Machine Learning and dictionary Methods.