information-retrieval

Information Gain Calculation for a text file?

◇◆丶佛笑我妖孽 提交于 2019-12-04 05:27:36
问题 I'm working on "text categorization using Information gain,PCA and Genetic Algorithm" But after performing Preprocessing (Stemming, stopword removal, TFIDF) on the document m confused how to move ahead for information gain part. my out file contain word and there TFIDF value. like WORD - TFIDF VALUE together(word) - 0.235(tfidf value) come(word) - 0.2548(tfidf value) when using weka for information gain (" InfoGainAttributeEval.java ") it require .arff file format as input. Is there any to

Is there a better way to find set intersection for Search engine code?

怎甘沉沦 提交于 2019-12-03 17:34:01
I have been coding up a small search engine and need to find out if there is a faster way to find set intersections. Currently, I am using a Sorted linked list as explained in most search engine algorithms. i.e for every word I have a list of documents sorted in a list and then find the intersection among the lists. The performance profiling of the case is here . Any other ideas for a faster set intersection? An efficient way to do it is by "zig-zag": Assume your terms is a list T : lastDoc <- 0 //the first doc in the collection currTerm <- 0 //the first term in T while (lastDoc != infinity):

Getting total term frequency throughout entire index (Elasticsearch)

↘锁芯ラ 提交于 2019-12-03 11:40:47
问题 I am trying to calculate the total number of times a particular term occurs throughout an entire index (term collection frequency). I have attempted to do so through the use of term vectors, however this is restricted to a single document. Even in the case of terms that exist within a specified document, the response seems to max out at a certain doc_count (within field_statistics) which makes me doubtful of its accuracy. Request: http://myip:9200/clinicaltrials/trial/AVmk-ky6XMskTDwIwpih/

Wikipedia text download

大城市里の小女人 提交于 2019-12-03 09:44:24
I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online? To just give you some overview of my project, I want to find out the interesting words of few articles I am interested in. But to find these interesting words, I am planning to apply tf/idf to calculate term frequency for each word and pick the ones with high frequency. But to calculate the tf, I need to know the total occurrences in whole of Wikipedia. How can this be done? from wikipedia: http://en.wikipedia.org

TFIDF calculating confusion

两盒软妹~` 提交于 2019-12-03 08:17:30
I found the following code on the internet for calculating TFIDF: https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error: return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList)))) But i am confused for two things: I get negative values in some cases, is this correct? I am confused with line 62, 63 and 64. Code: documentNumber = 0 for word in documentList[documentNumber].split(None): words[word] = tfidf(word,documentList[documentNumber],documentList) Should TFIDF be

What are some good methods to find the “relatedness” of two bodies of text?

∥☆過路亽.° 提交于 2019-12-03 07:12:32
问题 Here's the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to compare each to each, and calculate a relatedness factor so that I can show users related information. What are some good ways to do this? Are there known algorithms for doing this that are any good, are there any GPL'd solutions, etc? I don't need this to run in realtime, as I can precalculate everything. I'm more

Python or Java for text processing (text mining, information retrieval, natural language processing) [closed]

两盒软妹~` 提交于 2019-12-03 05:17:31
问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 6 years ago . I'm soon to start on a new project where I am going to do lots of text processing tasks like searching, categorization/classifying, clustering, and so on. There's going to be a huge amount of documents that need to be processed; probably millions of documents. After the

What are some alternatives to a bit array?

不打扰是莪最后的温柔 提交于 2019-12-03 04:04:08
I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of "set" bits in the array varies widely, from all clear to all set. Currently, I'm using a straight-forward bit array ( java.util.BitSet ), so each of my bit arrays takes several megabytes. My plan is to look at the cardinality of the first N bits, then make a decision about what data structure to use for the remainder. Clearly some data structures are better for very sparse bit arrays, and others when roughly half the bits are set (when most bits are set, I can use negation to

How can I extract only the main textual content from an HTML page?

血红的双手。 提交于 2019-12-03 03:16:03
问题 Update Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text. So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report. The Question I download

How to correct the user input (Kind of google “did you mean?”)

不想你离开。 提交于 2019-12-03 02:50:12
问题 I have the following requirement: - I have many (say 1 million) values (names). The user will type a search string. I don't expect the user to spell the names correctly. So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question. My question: - 1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have