information-retrieval

How to build a simple inverted index?

孤人 提交于 2019-12-03 02:12:49
问题 I wanna build a simple indexing function of search engine without any API, such as Lucene. In the inverted index, I just need to record basic information of each word, e.g. docID, position, and freqence. Now, I have several questions: What kind of data structure is often used for building inverted index? Multidimensional list? After building the index, how to write it into files? What kind of format in the file? Like a table? Like drawing a index table on paper? 回答1: You can see a very simple

Getting total term frequency throughout entire index (Elasticsearch)

假装没事ソ 提交于 2019-12-03 02:09:29
I am trying to calculate the total number of times a particular term occurs throughout an entire index (term collection frequency). I have attempted to do so through the use of term vectors, however this is restricted to a single document. Even in the case of terms that exist within a specified document, the response seems to max out at a certain doc_count (within field_statistics) which makes me doubtful of its accuracy. Request: http://myip:9200/clinicaltrials/trial/AVmk-ky6XMskTDwIwpih/_termvectors?term_statistics=true The document id being used here is "AVmk-ky6XMskTDwIwpih", although the

How to select stop words using tf-idf? (non english corpus)

别等时光非礼了梦想. 提交于 2019-12-02 22:53:54
I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document. Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune. The best (as in

What are some good methods to find the “relatedness” of two bodies of text?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-02 19:45:58
Here's the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to compare each to each, and calculate a relatedness factor so that I can show users related information. What are some good ways to do this? Are there known algorithms for doing this that are any good, are there any GPL'd solutions, etc? I don't need this to run in realtime, as I can precalculate everything. I'm more concerned with getting good results than runtime. I just thought I would ask the Stack Overflow community

Python or Java for text processing (text mining, information retrieval, natural language processing) [closed]

放肆的年华 提交于 2019-12-02 18:33:11
I'm soon to start on a new project where I am going to do lots of text processing tasks like searching, categorization/classifying, clustering, and so on. There's going to be a huge amount of documents that need to be processed; probably millions of documents. After the initial processing, it also has to be able to be updated daily with multiple new documents. Can I use Python to do this, or is Python too slow? Is it best to use Java? If possible, I would prefer Python since that's what I have been using lately. Plus, I would finish the coding part much faster. But it all depends on Python's

How to calculate TF*IDF for a single new document to be classified?

折月煮酒 提交于 2019-12-02 18:13:11
I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification. I am looking forward to classify new document in future. But in order to classify it, I need to turn the document into a document-term vector first, and the vector should be composed of TF*IDF values, too. My question is, how could I calculate the TF*IDF with just a single document? As far as I understand, TF can be calculated based on a single document itself, but the IDF can only

How can I extract only the main textual content from an HTML page?

China☆狼群 提交于 2019-12-02 16:46:45
Update Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text. So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report. The Question I download some pages from random sites, and now I want to analyze the textual content of the page. The problem is

Java Open Source Text Mining Frameworks [closed]

感情迁移 提交于 2019-12-02 16:25:20
I want to know what is the best open source Java based framework for Text Mining, to use botg Machine Learning and dictionary Methods. I'm using Mallet but there are not that much documentation and I do not know if it will fit all my requirements. I honestly think that the several answers presented here are very good. However, to fulfill my requirements I have chosen to use Apache UIMA with ClearTK . It supports several ML Methods and I do not have any licences problem. Plus, I can make wrappers to other ML methodologies, and I take the advantage of the UIMA framework, which is very well

How to build a simple inverted index?

时光毁灭记忆、已成空白 提交于 2019-12-02 14:12:54
I wanna build a simple indexing function of search engine without any API, such as Lucene. In the inverted index, I just need to record basic information of each word, e.g. docID, position, and freqence. Now, I have several questions: What kind of data structure is often used for building inverted index? Multidimensional list? After building the index, how to write it into files? What kind of format in the file? Like a table? Like drawing a index table on paper? You can see a very simple implementation of inverted index and search in TinySearchEngine . For your first question, if you want to

Information Gain Calculation for a text file?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-02 06:57:31
I'm working on "text categorization using Information gain,PCA and Genetic Algorithm" But after performing Preprocessing (Stemming, stopword removal, TFIDF) on the document m confused how to move ahead for information gain part. my out file contain word and there TFIDF value. like WORD - TFIDF VALUE together(word) - 0.235(tfidf value) come(word) - 0.2548(tfidf value) when using weka for information gain (" InfoGainAttributeEval.java ") it require .arff file format as input. Is there any to convert text file into .arff format. or any other way to preform Information gain other than weka? Is