information-retrieval

Java NLP: Extracting Indicies When Tokenizing Text

六眼飞鱼酱① 提交于 2021-02-20 04:54:46
问题 When tokenizing a string of text, I need to extract the indexes of the tokenized words. For example, given: "Mary didn't kiss John" I would need something like: [(Mary, 0), (did, 5), (n't, 8), (kiss, 12), (John, 17)] Where 0, 5, 8, 12 and 17 correspond to the index (in the original string) where the token began. I cannot rely on just whitespace, since some words become 2 tokens. Further, I cannot just search for the token in the string, since the word likely will appear multiple times. One

ElasticSearch: obtaining individual scores from each query inside of a bool query

喜你入骨 提交于 2021-02-11 14:48:01
问题 Assume I have a compound bool query with various "must" and "should" statements that each may include different leaf queries including "multi-match" and "match_phrase" queries such as below. How can I get the score from individual queries packed into a single query? I know one way could be to break it down into multiple queries, execute each, and then aggregate the results in code-level (not query-level). However, I suppose that is less efficient, plus, I lose sorting/pagination/.... features

Estimate Dictionary size using Zipf’s Law

倾然丶 夕夏残阳落幕 提交于 2021-02-05 12:20:47
问题 How would one go about Calculating the Dictionary Size(no.of unique words) of a collection using Zipfs Law? 回答1: You will have to tokenize your collection, e.g. by white-space and punctuation. Then you store all the tokens in a hash and count. What you do is then plot the distribution of the counts using a tool like Gnuplot . 来源: https://stackoverflow.com/questions/47543798/estimate-dictionary-size-using-zipf-s-law

Estimate Dictionary size using Zipf’s Law

五迷三道 提交于 2021-02-05 12:20:04
问题 How would one go about Calculating the Dictionary Size(no.of unique words) of a collection using Zipfs Law? 回答1: You will have to tokenize your collection, e.g. by white-space and punctuation. Then you store all the tokens in a hash and count. What you do is then plot the distribution of the counts using a tool like Gnuplot . 来源: https://stackoverflow.com/questions/47543798/estimate-dictionary-size-using-zipf-s-law

Understanding Recall and Precision

陌路散爱 提交于 2020-08-02 04:39:48
问题 I am currently learning Information retrieval and i am rather stuck with an example of recall and precision A searcher uses a search engine to look for information. There are 10 documents on the first screen of results and 10 on the second. Assuming there is known to be 10 relevant documents in the search engines index. Soo... there is 20 searches all together of which 10 are relevant. Can anyone help me make sense of this? Thanks 回答1: Recall and precision measure the quality of your result.

PHP: Taking Array (CSV) And Intelligently Returning Information

北慕城南 提交于 2020-01-17 00:53:30
问题 Hey Everyone. I'm a first time poster, but I've browsed this site a number of times. I have a coding issue that I'm not sure exactly how to solve. First I'll explain what I need to do, and what information I have, and I hope somebody can give me a nudge in the right direction. What I have is a spreadsheet (CSV) that has the following info: Zone Name, Zip Code, City Name. One zone should have many cities that fall under it, and every city most likely has many zip codes that fall under it. For

How to remove OCR artifacts from text?

ⅰ亾dé卋堺 提交于 2020-01-13 11:29:10
问题 OCR generated texts sometimes come with artifacts, such as this one: Diese grundsätzliche V e r b o r g e n h e i t Gottes, die sich n u r dem N a c h f o l g e r ö f f n e t , ist m i t d e m Messiasgeheimnis gemeint While it is not unusual, that the spacing between letters is used as emphasis (probably due to early printing press limitations), it is unfavorable for retrieval tasks. How can one turn the above text into a more, say, canonical form, like: Diese grundsätzliche Verborgenheit

How to select stop words using tf-idf? (non english corpus)

戏子无情 提交于 2020-01-11 20:01:10
问题 I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document. 回答1: Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in