n-gram

Elasticsearch: Find substring match

风格不统一 提交于 2019-11-28 04:05:51
I want to perform both exact word match and partial word/substring match. For example if I search for "men's shaver" then I should be able to find "men's shaver" in the result. But in case case I search for "en's shaver" then also I should be able to find "men's shaver" in the result. I using following settings and mappings: Index settings: PUT /my_index { "settings": { "number_of_shards": 1, "analysis": { "filter": { "autocomplete_filter": { "type": "edge_ngram", "min_gram": 1, "max_gram": 20 } }, "analyzer": { "autocomplete": { "type": "custom", "tokenizer": "standard", "filter": [

find all two word phrases that appear in more than one row in a dataset

笑着哭i 提交于 2019-11-28 01:54:19
问题 We would like to run a query that returns two word phrases that appear in more than one row. So for e.g. take the string "Data Ninja". Since it appears in more than one row in our dataset, the query should return that. The query should find all such phrases from all the rows in our dataset, by querying for two adjacent word combination (forming a phrase) in the rows that are in the dataset. These two adjacent word combinations should come from the dataset we loaded into BigQuery How can we

Hibernate Search | ngram analyzer with minGramSize 1

被刻印的时光 ゝ 提交于 2019-11-27 16:26:42
I have some problems with my Hibernate Search analyzer configuration. One of my indexed entities ("Hospital") has a String field ("name") that could contain values with lengths from 1-40. I want to be able to find a entity by searching for just one character (because it could be possible, that a hospital has single character name). @Indexed(index = "HospitalIndex") @AnalyzerDef(name = "ngram", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = StandardFilterFactory.class), @TokenFilterDef(factory = LowerCaseFilterFactory.class),

n-grams with Naive Bayes classifier

跟風遠走 提交于 2019-11-27 13:16:21
问题 Im new to python and need help! i was practicing with python NLTK text classification. Here is the code example i am practicing on http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/ Ive tried this one from nltk import bigrams from nltk.probability import ELEProbDist, FreqDist from nltk import NaiveBayesClassifier from collections import defaultdict train_samples = {} with file ('positive.txt', 'rt') as f: for line in f.readlines(): train_samples[line]='pos'

自动文档摘要评价方法:Edmundson,ROUGE

懵懂的女人 提交于 2019-11-27 12:44:27
自动文档摘要评价方法大致分为两类: (1) 内部评价方法(Intrinsic Methods) :提供参考摘要,以参考摘要为基准评价系统摘要的质量。系统摘要与参考摘要越吻合, 质量越高。 (2) 外部评价方法(Extrinsic Methods) :不提供参考摘要,利用文档摘要代替原文档执行某个文档相关的应用。例如:文档检索、文档聚类、文档分类等, 能够提高应用性能的摘要被认为是质量好的摘要。 一、Edmundson:   Edmundson评价方法比较简单,可以客观评估,就是通过比较机械文摘(自动文摘系统得到的文摘)与目标文摘的句子重合率(coselection rate)的高低来对系统摘要进行评价。也可以主观评估,就是由专家比较机械文摘与目标文摘所含的信息,然后给机械文摘一个等级评分。 类如等级可以分为:完全不相似,基本相似,很相似,完全相似等。   Edmundson比较的基本单位是句子,通过句子级标号分隔开的文本单元,句子级标号包括“。”“:”“;”“!”“?”,并且只允许专家从原文中抽取句子,而不允许专家根据自己对原文的理解重新生成句子,专家文摘和机械文摘的句子都按照在原文中出现的先后顺序给出。    计算公式为: \[ \text{重合率p}=\text{匹配句子数}/\text{专家文摘句子数}\times \] 每一个机械文摘的重合率为按三个专家给出的

Fast n-gram calculation

断了今生、忘了曾经 提交于 2019-11-27 11:23:23
I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up? Fred Foo Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality. I also assume you start with a list of tokens, represented by strings. What you

Really fast word ngram vectorization in R

拟墨画扇 提交于 2019-11-27 11:19:28
问题 edit: The new package text2vec is excellent, and solves this problem (and many others) really well. text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization I have a pretty large text dataset in R, which I've imported as a character vector: #Takes about 15 seconds system.time({ set.seed(1) samplefun <- function(n, x, collapse){ paste(sample(x, n, replace=TRUE), collapse=collapse) } words <- sapply(rpois(10000, 3) + 1, samplefun, letters, '') sents1 <- sapply(rpois

Quick implementation of character n-grams for word

早过忘川 提交于 2019-11-27 08:38:19
I wrote the following code for computing character bigrams and the output is right below. My question is, how do I get an output that excludes the last character (ie t)? and is there a quicker and more efficient method for computing character n-grams? b='student' >>> y=[] >>> for x in range(len(b)): n=b[x:x+2] y.append(n) >>> y ['st', 'tu', 'ud', 'de', 'en', 'nt', 't'] Here is the result I would like to get: ['st','tu','ud','de','nt] Thanks in advance for your suggestions. To generate bigrams: In [8]: b='student' In [9]: [b[i:i+2] for i in range(len(b)-1)] Out[9]: ['st', 'tu', 'ud', 'de', 'en'

How to compute skipgrams in python?

跟風遠走 提交于 2019-11-27 05:40:06
问题 A k skipgram is an ngram which is a superset of all ngrams and each (k-i )skipgram till (k-i)==0 (which includes 0 skip grams). So how to efficiently compute these skipgrams in python? Following is the code i tried but it is not doing as expected: <pre> input_list = ['all', 'this', 'happened', 'more', 'or', 'less'] def find_skipgrams(input_list, N,K): bigram_list = [] nlist=[] K=1 for k in range(K+1): for i in range(len(input_list)-1): if i+k+1<len(input_list): nlist=[] for j in range(N+1):

What algorithm I need to find n-grams?

邮差的信 提交于 2019-11-27 04:16:26
问题 What algorithm is used for finding ngrams? Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use? I'm asking for code, with preference for R. The data is stored in database, so can be a plgpsql function too. Java is a language I know better, so I can "translate" it to another language. I'm not lazy, I'm only asking for code because I don't want to reinvent the wheel trying to do an algorithm that is already done. Edit: it's