n-gram

String Matching Using TF-IDF, NGrams and Cosine Similarity in Python

倾然丶 夕夏残阳落幕 提交于 2021-02-17 20:59:49
问题 I am working on my first major data science project. I am attempting to match names between a large list of data from one source, to a cleansed dictionary in another. I am using this string matching blog as a guide. I am attempting to use two different data sets. Unfortunately, I can't seem to get good results and I think I am not applying this appropriately. Code: import pandas as pd, numpy as np, re, sparse_dot_topn.sparse_dot_topn as ct from sklearn.feature_extraction.text import

How to use an ngram and edge ngram tokenizer together in elasticsearch index?

大兔子大兔子 提交于 2021-02-11 14:21:38
问题 I have an index containing 3 documents. { "firstname": "Anne", "lastname": "Borg", } { "firstname": "Leanne", "lastname": "Ray" }, { "firstname": "Anne", "middlename": "M", "lastname": "Stone" } When I search for "Anne", I would like elastic to return all 3 of these documents (because they all match the term "Anne" to a degree). BUT, I would like Leanne Ray to have a lower score (relevance ranking) because the search term "Anne" appears at a later position in this document than the term

Binary Classification using the N-Grams

萝らか妹 提交于 2021-02-11 06:51:48
问题 I want to extract the ngrams of the tweets, from two groups of users (0/1), to make a CSV file as follows for a binary classifier. user_tweets, ngram1, ngram2, ngram3, ..., label 1, 0.0, 0.0, 0.0, ..., 0 2, 0.0, 0.0, 0.0, ..., 1 .. My question is whether I should first extract the important ngrams of the two groups, and then score each ngram that I found in the user's tweets? or is there an easier way to do this? 来源: https://stackoverflow.com/questions/66092089/binary-classification-using-the

Finding conditional probability of trigram in python nltk

你离开我真会死。 提交于 2021-02-07 06:26:05
问题 I have started learning NLTK and I am following a tutorial from here, where they find conditional probability using bigrams like this. import nltk from nltk.corpus import brown cfreq_brown_2gram = nltk.ConditionalFreqDist(nltk.bigrams(brown.words())) However I want to find conditional probability using trigrams. When I try to change nltk.bigrams to nltk.trigrams I get the following error. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "home/env/local/lib/python2.7

Finding conditional probability of trigram in python nltk

回眸只為那壹抹淺笑 提交于 2021-02-07 06:25:32
问题 I have started learning NLTK and I am following a tutorial from here, where they find conditional probability using bigrams like this. import nltk from nltk.corpus import brown cfreq_brown_2gram = nltk.ConditionalFreqDist(nltk.bigrams(brown.words())) However I want to find conditional probability using trigrams. When I try to change nltk.bigrams to nltk.trigrams I get the following error. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "home/env/local/lib/python2.7

Dataframe aggregation of n-gram, their frequency and associate the entries of other columns with it using R

人盡茶涼 提交于 2021-01-29 15:48:18
问题 I am trying to aggregate a dataframe based on 1-gram (can be extended to n-gram by changing n in the code below) frequency and associate other columns to it. The way I did it is shown below. Are there any other shortcuts/ alternatives to produce the table shown at the very end of this question for the dataframe given below? The code and the results are shown below. The below chunk sets the environment, loads the libraries and reads the dataframe: # Clear variables in the working environment

Merging or reversing n-grams to a single string

眉间皱痕 提交于 2021-01-27 23:44:54
问题 How do I merge the bigrams below to a single string? _bigrams=['the school', 'school boy', 'boy is', 'is reading'] _split=(' '.join(_bigrams)).split() _newstr=[] _filter=[_newstr.append(x) for x in _split if x not in _newstr] _newstr=' '.join(_newstr) print _newstr Output:'the school boy is reading' ....its the desired output but the approach is too long and not quite efficient given the large size of my data. Secondly, the approach would not support duplicate words in the final string ie

Keyword in context (kwic) for skipgrams?

蓝咒 提交于 2020-12-12 02:07:06
问题 I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry. The following code a kwic object which is empty but I don't know what I did wrong. dcc.corpus refers to the text document. I also used the tokenized version but nothing changes. The result is: "kwic object with 0 rows" x <- tokens("barriers entry") ntoken_test <- tokens_ngrams(x, n = 2,

Keyword in context (kwic) for skipgrams?

末鹿安然 提交于 2020-12-12 02:06:18
问题 I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry. The following code a kwic object which is empty but I don't know what I did wrong. dcc.corpus refers to the text document. I also used the tokenized version but nothing changes. The result is: "kwic object with 0 rows" x <- tokens("barriers entry") ntoken_test <- tokens_ngrams(x, n = 2,

Keyword in context (kwic) for skipgrams?

て烟熏妆下的殇ゞ 提交于 2020-12-12 02:02:34
问题 I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry. The following code a kwic object which is empty but I don't know what I did wrong. dcc.corpus refers to the text document. I also used the tokenized version but nothing changes. The result is: "kwic object with 0 rows" x <- tokens("barriers entry") ntoken_test <- tokens_ngrams(x, n = 2,