nlp | 易学教程

Keyword in context (kwic) for skipgrams?

阅读更多关于 Keyword in context (kwic) for skipgrams?

问题 I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry. The following code a kwic object which is empty but I don't know what I did wrong. dcc.corpus refers to the text document. I also used the tokenized version but nothing changes. The result is: "kwic object with 0 rows" x <- tokens("barriers entry") ntoken_test <- tokens_ngrams(x, n = 2,

Keyword in context (kwic) for skipgrams?

阅读更多关于 Keyword in context (kwic) for skipgrams?

How to rate quality of a (scraped) sentence?

阅读更多关于 How to rate quality of a (scraped) sentence?

问题 I am running a scrape and process routine in Python3 - but some of the sentences I get are garbage. I would like to reject these but cant figure out how to do it. I am using POS tagging and chunking with NLTK but that doesn't seem to help me identify non-valid sentences. The number of NNs, VBs etc. doesn't seem to be any different in a garbage "sentence" than a good one. I guess I am just looking for a simple method to score the grammar of a sentence and reject ones with too many "errors". I

CountVectorizer: Vocabulary wasn't fitted

阅读更多关于 CountVectorizer: Vocabulary wasn't fitted

问题 I instantiated a sklearn.feature_extraction.text.CountVectorizer object by passing a vocabulary through the vocabulary argument, but I get a sklearn.utils.validation.NotFittedError: CountVectorizer - Vocabulary wasn't fitted. error message. Why? Example: import sklearn.feature_extraction import numpy as np import pickle # Save the vocabulary ngram_size = 1 dictionary_filepath = 'my_unigram_dictionary' vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram

Python: Chunking others than noun phrases (e.g. prepositional) using Spacy, etc

阅读更多关于 Python: Chunking others than noun phrases (e.g. prepositional) using Spacy, etc

问题 Since I was told Spacy was such a powerful Python module for natural speech processing, I am now desperately looking for a way to group words together to more than noun phrases, most importantly, prepositional phrases. I doubt there is a Spacy function for this but that would be the easiest way I guess (SpacySpaCy import is already implemented in my project). Nevertheless, I'm open for any possibility of phrase recognition/ chunking. 回答1: Here's a solution to get PPs. In general you can get

Doc2Vec Get most similar documents

阅读更多关于 Doc2Vec Get most similar documents

问题 I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec model using the Doc2Vec model in gensim. My dataset is in the form of a pandas dataset which has each document stored as a string on each line. This is the code I have so far import gensim, re import pandas as pd # TOKENIZER def tokenizer(input_string): return re.findall(r"[\w']+", input_string) # IMPORT DATA data = pd