n-gram | 易学教程

Quick implementation of character n-grams for word

阅读更多关于 Quick implementation of character n-grams for word

问题 I wrote the following code for computing character bigrams and the output is right below. My question is, how do I get an output that excludes the last character (ie t)? and is there a quicker and more efficient method for computing character n-grams? b='student' >>> y=[] >>> for x in range(len(b)): n=b[x:x+2] y.append(n) >>> y ['st', 'tu', 'ud', 'de', 'en', 'nt', 't'] Here is the result I would like to get: ['st','tu','ud','de','nt] Thanks in advance for your suggestions. 回答1: To generate

Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

阅读更多关于 Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

问题 I need to compare documents stored in a DB and come up with a similarity score between 0 and 1. The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity. Is there any program that can do this? Or should I start writing this from scratch? 回答1: Check out NLTK package: http://www.nltk.org it has everything what you need For the cosine_similarity

Computing N Grams using Python

阅读更多关于 Computing N Grams using Python

问题 I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like: \"Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although side effects include a nasty coughing fit and a harsh taste. That\'s the conclusion of two studies published in this week\'s issue of The New England Journal of Medicine.\" I started in Python and used

Filename search with ElasticSearch

阅读更多关于 Filename search with ElasticSearch

问题 I want to use ElasticSearch to search filenames (not the file\'s content). Therefore I need to find a part of the filename (exact match, no fuzzy search). Example: I have files with the following names: My_first_file_created_at_2012.01.13.doc My_second_file_created_at_2012.01.13.pdf Another file.txt And_again_another_file.docx foo.bar.txt Now I want to search for 2012.01.13 to get the first two files. A search for file or ile should return all filenames except the last one. How can i

N-gram generation from a sentence

阅读更多关于 N-gram generation from a sentence

问题 How to generate an n-gram of a string like: String Input=\"This is my car.\" I want to generate n-gram with this input: Input Ngram size = 3 Output should be: This is my car This is is my my car This is my is my car Give some idea in Java, how to implement that or if any library is available for it. I am trying to use this NGramTokenizer but its giving n-gram\'s of character sequence and I want n-grams of word sequence. 回答1: You are looking for ShingleFilter. Update: The link points to

n-grams in python, four, five, six grams?

阅读更多关于 n-grams in python, four, five, six grams?

问题 I\'m looking for a way to split a text into n-grams. Normally I would do something like: import nltk from nltk import bigrams string = \"I really like python, it\'s pretty awesome.\" string_bigrams = bigrams(string) print string_bigrams I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams? Thanks! 回答1: Great native python based answers given by other users. But here's the nltk approach (just in case, the OP