n-gram

n-gram name analysis in non-english languages (CJK, etc)

て烟熏妆下的殇ゞ 提交于 2019-12-05 01:24:37
问题 I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature. First, I "block"- iterate over the whole dataset, and bin each record based on n-grams AND initials present in the name. Second, all the records per bin are compared using Jaro-Winkler to get a measure of the likelihood of their representing the same person. My problem- the names are Unicode. Some (though not

Can Drupal's search module search for a substring? (Partial Search)

岁酱吖の 提交于 2019-12-04 17:49:44
问题 Drupal's core search module, only searches for keywords, e.g. "sandwich". Can I make it search with a substring e.g. "sandw" and return my sandwich-results? Maybe there is a plugin that does that? 回答1: The most direct module for it is probably Fuzzy Search. I have not tried it. If you have more advanced search needs on a small to medium sized site, Search Lucene API is a fine solution. For a larger site, or truly advanced needs, Solr is the premiere solution. 回答2: Recently I made a patch for

Compute ngrams for each row of text data in R

六眼飞鱼酱① 提交于 2019-12-04 17:47:59
I have a data column of the following format: Text Hello world Hello How are you today I love stackoverflow blah blah blahdy I would like to compute the 3-grams for each row in this dataset by perhaps using the tau package's textcnt() function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately? Is this what you're after? library("RWeka") library("tm") TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) # Using Tyler's method of making the 'Text'

Effective 1-5 grams extraction with python

本秂侑毒 提交于 2019-12-04 17:26:21
问题 I have a huge files of 3,000,000 lines and each line have 20-40 words. I have to extract 1 to 5 ngrams from the corpus. My input files are tokenized plain text, e.g.: This is a foo bar sentence . There is a comma , in this sentence . Such is an example text . Currently, I am doing it as below but this don't seem to be a efficient way to extract the 1-5grams: #!/usr/bin/env python -*- coding: utf-8 -*- import io, os from collections import Counter import sys; reload(sys); sys

Is there a bi gram or tri gram feature in Spacy?

此生再无相见时 提交于 2019-12-04 10:08:50
The below code breaks the sentence into individual tokens and the output is as below "cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies" import en_core_web_sm nlp = en_core_web_sm.load() doc = nlp("Cloud computing is benefiting major manufacturing companies") for token in doc: print(token.text) What I would ideally want is, to read 'cloud computing' together as it is technically one word. Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi gram or Tri grams ? Spacy allows detection of noun chunks. So to parse your noun phrases as single

Ngram model and perplexity in NLTK

偶尔善良 提交于 2019-12-04 08:15:28
问题 To put my question in context, I would like to train and test/compare several (neural) language models. In order to focus on the models rather than data preparation I chose to use the Brown corpus from nltk and train the Ngrams model provided with the nltk as a baseline (to compare other LM against). So my first question is actually about a behaviour of the Ngram model of nltk that I find suspicious. Since the code is rather short I pasted it here: import nltk print "... build" brown = nltk

grouping all Named entities in a Document

主宰稳场 提交于 2019-12-04 06:16:51
问题 I would like to group all named entities in a given document. For Example, **Barack Hussein Obama** II is the 44th and current President of the United States, and the first African American to hold the office. I do not want to use OpenNLP APIs as it might not be able to recognize all named entities. Is there any way to generate such n-grams using other services or may be a way to group all noun terms together. 回答1: If you want to avoid using NER, you could use a sentence chunker or parser.

Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

流过昼夜 提交于 2019-12-04 05:16:22
When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets With document having text = "CouchDB" When i search for "couc" My highlight is on "cou" and not "couc" It seems my highlight is only on the minimum matching token "cou" while i would expect to be on the exact token (if possible) or at least the longest token found. It works fine without analyzing the text with term_vector=with_positions_offsets What's the impact of removing the term_vector=with_positions_offsets for perfomances? When you set term_vector=with_positions_offsets for a specific field

“Anagram solver” based on statistics rather than a dictionary/table?

不打扰是莪最后的温柔 提交于 2019-12-03 19:11:19
问题 My problem is conceptually similar to solving anagrams, except I can't just use a dictionary lookup. I am trying to find plausible words rather than real words. I have created an N-gram model (for now, N=2) based on the letters in a bunch of text. Now, given a random sequence of letters, I would like to permute them into the most likely sequence according to the transition probabilities. I thought I would need the Viterbi algorithm when I started this, but as I look deeper, the Viterbi

n-gram name analysis in non-english languages (CJK, etc)

不羁岁月 提交于 2019-12-03 16:33:34
I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature . First, I "block"- iterate over the whole dataset, and bin each record based on n-grams AND initials present in the name. Second, all the records per bin are compared using Jaro-Winkler to get a measure of the likelihood of their representing the same person. My problem- the names are Unicode. Some (though not many) of these names are in CJK (Chinese-Japanese-Korean) languages. I have no idea how to find word