n-gram

Python: Reducing memory usage of dictionary

三世轮回 提交于 2019-11-27 02:51:26
I'm trying to load a couple of files into the memory. The files have either of the following 3 formats: string TAB int string TAB float int TAB float. Indeed, they are ngram statics files, in case this helps with the solution. For instance: i_love TAB 10 love_you TAB 12 Currently, the pseudocode of I'm doing right now is loadData(file): data = {} for line in file: first, second = line.split('\t') data[first] = int(second) #or float(second) return data To much of my surprise, while the total size of the files in disk is about 21 mb, when loaded into memory the process takes 120 - 180 mb of

Fast/Optimize N-gram implementations in python

被刻印的时光 ゝ 提交于 2019-11-27 02:05:48
Which ngram implementation is fastest in python? I've tried to profile nltk's vs scott's zip ( http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/ ): from nltk.util import ngrams as nltkngram import this, time def zipngram(text,n=2): return zip(*[text.split()[i:] for i in range(n)]) text = this.s start = time.time() nltkngram(text.split(), n=2) print time.time() - start start = time.time() zipngram(text, n=2) print time.time() - start [out] 0.000213146209717 6.50882720947e-05 Is there any faster implementation for generating ngrams in python? Some attempts with some

Elasticsearch: Find substring match

早过忘川 提交于 2019-11-27 00:13:36
问题 I want to perform both exact word match and partial word/substring match. For example if I search for "men's shaver" then I should be able to find "men's shaver" in the result. But in case case I search for "en's shaver" then also I should be able to find "men's shaver" in the result. I using following settings and mappings: Index settings: PUT /my_index { "settings": { "number_of_shards": 1, "analysis": { "filter": { "autocomplete_filter": { "type": "edge_ngram", "min_gram": 1, "max_gram":

Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

自古美人都是妖i 提交于 2019-11-26 23:20:43
I need to compare documents stored in a DB and come up with a similarity score between 0 and 1. The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity. Is there any program that can do this? Or should I start writing this from scratch? Check out NLTK package: http://www.nltk.org it has everything what you need For the cosine_similarity: def cosine_distance(u, v): """ Returns the cosine of the angle between vectors v and u. This is equal to u

Hibernate Search | ngram analyzer with minGramSize 1

强颜欢笑 提交于 2019-11-26 22:28:32
问题 I have some problems with my Hibernate Search analyzer configuration. One of my indexed entities ("Hospital") has a String field ("name") that could contain values with lengths from 1-40. I want to be able to find a entity by searching for just one character (because it could be possible, that a hospital has single character name). @Indexed(index = "HospitalIndex") @AnalyzerDef(name = "ngram", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef

Computing N Grams using Python

房东的猫 提交于 2019-11-26 22:02:40
I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like: "Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although side effects include a nasty coughing fit and a harsh taste. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine." I started in Python and used the following code: #!/usr/bin/env python # File: n-gram.py def N_Gram(N,text): NList = [] # start with an

Filename search with ElasticSearch

若如初见. 提交于 2019-11-26 19:30:48
I want to use ElasticSearch to search filenames (not the file's content). Therefore I need to find a part of the filename (exact match, no fuzzy search). Example: I have files with the following names: My_first_file_created_at_2012.01.13.doc My_second_file_created_at_2012.01.13.pdf Another file.txt And_again_another_file.docx foo.bar.txt Now I want to search for 2012.01.13 to get the first two files. A search for file or ile should return all filenames except the last one. How can i accomplish that with ElasticSearch? This is what I have tested, but it always returns zero results: curl -X

N-gram generation from a sentence

こ雲淡風輕ζ 提交于 2019-11-26 18:53:50
How to generate an n-gram of a string like: String Input="This is my car." I want to generate n-gram with this input: Input Ngram size = 3 Output should be: This is my car This is is my my car This is my is my car Give some idea in Java, how to implement that or if any library is available for it. I am trying to use this NGramTokenizer but its giving n-gram's of character sequence and I want n-grams of word sequence. You are looking for ShingleFilter . Update: The link points to version 3.0.2. This class may be in different package in newer version of Lucene. I believe this would do what you

n-grams in python, four, five, six grams?

落爺英雄遲暮 提交于 2019-11-26 16:59:47
I'm looking for a way to split a text into n-grams. Normally I would do something like: import nltk from nltk import bigrams string = "I really like python, it's pretty awesome." string_bigrams = bigrams(string) print string_bigrams I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams? Thanks! alvas Great native python based answers given by other users. But here's the nltk approach (just in case, the OP gets penalized for reinventing what's already existing in the nltk library). There is an ngram module that

Fast n-gram calculation

淺唱寂寞╮ 提交于 2019-11-26 15:32:47
问题 I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up? 回答1: Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without