n-gram | 易学教程

n-gram name analysis in non-english languages (CJK, etc)

阅读更多关于 n-gram name analysis in non-english languages (CJK, etc)

问题 I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature. First, I "block"- iterate over the whole dataset, and bin each record based on n-grams AND initials present in the name. Second, all the records per bin are compared using Jaro-Winkler to get a measure of the likelihood of their representing the same person. My problem- the names are Unicode. Some (though not

Can Drupal's search module search for a substring? (Partial Search)

阅读更多关于 Can Drupal's search module search for a substring? (Partial Search)

问题 Drupal's core search module, only searches for keywords, e.g. "sandwich". Can I make it search with a substring e.g. "sandw" and return my sandwich-results? Maybe there is a plugin that does that? 回答1: The most direct module for it is probably Fuzzy Search. I have not tried it. If you have more advanced search needs on a small to medium sized site, Search Lucene API is a fine solution. For a larger site, or truly advanced needs, Solr is the premiere solution. 回答2: Recently I made a patch for

Compute ngrams for each row of text data in R

阅读更多关于 Compute ngrams for each row of text data in R

I have a data column of the following format: Text Hello world Hello How are you today I love stackoverflow blah blah blahdy I would like to compute the 3-grams for each row in this dataset by perhaps using the tau package's textcnt() function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately? Is this what you're after? library("RWeka") library("tm") TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) # Using Tyler's method of making the 'Text'

Effective 1-5 grams extraction with python

阅读更多关于 Effective 1-5 grams extraction with python

问题 I have a huge files of 3,000,000 lines and each line have 20-40 words. I have to extract 1 to 5 ngrams from the corpus. My input files are tokenized plain text, e.g.: This is a foo bar sentence . There is a comma , in this sentence . Such is an example text . Currently, I am doing it as below but this don't seem to be a efficient way to extract the 1-5grams: #!/usr/bin/env python -*- coding: utf-8 -*- import io, os from collections import Counter import sys; reload(sys); sys

Is there a bi gram or tri gram feature in Spacy?

阅读更多关于 Is there a bi gram or tri gram feature in Spacy?

The below code breaks the sentence into individual tokens and the output is as below "cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies" import en_core_web_sm nlp = en_core_web_sm.load() doc = nlp("Cloud computing is benefiting major manufacturing companies") for token in doc: print(token.text) What I would ideally want is, to read 'cloud computing' together as it is technically one word. Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi gram or Tri grams ? Spacy allows detection of noun chunks. So to parse your noun phrases as single

Ngram model and perplexity in NLTK

阅读更多关于 Ngram model and perplexity in NLTK

问题 To put my question in context, I would like to train and test/compare several (neural) language models. In order to focus on the models rather than data preparation I chose to use the Brown corpus from nltk and train the Ngrams model provided with the nltk as a baseline (to compare other LM against). So my first question is actually about a behaviour of the Ngram model of nltk that I find suspicious. Since the code is rather short I pasted it here: import nltk print "... build" brown = nltk

grouping all Named entities in a Document

阅读更多关于 grouping all Named entities in a Document

问题 I would like to group all named entities in a given document. For Example, **Barack Hussein Obama** II is the 44th and current President of the United States, and the first African American to hold the office. I do not want to use OpenNLP APIs as it might not be able to recognize all named entities. Is there any way to generate such n-grams using other services or may be a way to group all noun terms together. 回答1: If you want to avoid using NER, you could use a sentence chunker or parser.

Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

阅读更多关于 Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets With document having text = "CouchDB" When i search for "couc" My highlight is on "cou" and not "couc" It seems my highlight is only on the minimum matching token "cou" while i would expect to be on the exact token (if possible) or at least the longest token found. It works fine without analyzing the text with term_vector=with_positions_offsets What's the impact of removing the term_vector=with_positions_offsets for perfomances? When you set term_vector=with_positions_offsets for a specific field

“Anagram solver” based on statistics rather than a dictionary/table?

阅读更多关于 “Anagram solver” based on statistics rather than a dictionary/table?

问题 My problem is conceptually similar to solving anagrams, except I can't just use a dictionary lookup. I am trying to find plausible words rather than real words. I have created an N-gram model (for now, N=2) based on the letters in a bunch of text. Now, given a random sequence of letters, I would like to permute them into the most likely sequence according to the transition probabilities. I thought I would need the Viterbi algorithm when I started this, but as I look deeper, the Viterbi

n-gram name analysis in non-english languages (CJK, etc)

阅读更多关于 n-gram name analysis in non-english languages (CJK, etc)

I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature . First, I "block"- iterate over the whole dataset, and bin each record based on n-grams AND initials present in the name. Second, all the records per bin are compared using Jaro-Winkler to get a measure of the likelihood of their representing the same person. My problem- the names are Unicode. Some (though not many) of these names are in CJK (Chinese-Japanese-Korean) languages. I have no idea how to find word