nlp | 易学教程

Testing text classification ML model with new data fails

阅读更多关于 Testing text classification ML model with new data fails

问题 I have built a machine learning model to classify emails as spams or not. Now i want to test my own email and see the result. So i wrote the following code to classify the new email: message = """Subject: Hello this is from google security team we want to recover your password. Please contact us as soon as possible""" message = pd.Series([message,]) transformed_message = CountVectorizer(analyzer=process_text).fit_transform(message) proba = model.predict_proba(transformed_message)[0] Knowing

Testing text classification ML model with new data fails

阅读更多关于 Testing text classification ML model with new data fails

Testing text classification ML model with new data fails

阅读更多关于 Testing text classification ML model with new data fails

Testing text classification ML model with new data fails

阅读更多关于 Testing text classification ML model with new data fails

gensim most_similar with positive and negative, how does it work?

阅读更多关于 gensim most_similar with positive and negative, how does it work?

问题 I was reading this answer That says about Gensim most_similar : it performs vector arithmetic: adding the positive vectors, subtracting the negative, then from that resulting position, listing the known-vectors closest to that angle. But when I tested it, that is not the case. I trained a Word2Vec with Gensim "text8" dataset and tested these two: model.most_similar(positive=['woman', 'king'], negative=['man']) >>> [('queen', 0.7131118178367615), ('prince', 0.6359186768531799),...] model.wv

gensim most_similar with positive and negative, how does it work?

阅读更多关于 gensim most_similar with positive and negative, how does it work?

Sentence structure analysis

阅读更多关于 Sentence structure analysis

问题 I am trying to look at the structure similarity of sentences, specifically to the position of verbs, adj, nouns. For instance, I have three (or more) sentences which look likes as follows: I ate an apple pie, yesterday. I ate an orange, yesterday. I eat a lemon, today. All of them starts with a pronoun (I) followed by a verb (ate/eat) and a noun (apple pie, orange, lemon) and, finally, an adverb (yesterday/tomorrow). I would like to know if there is a way to identify the structure, i.e.

how to view tf-idf score against each word

阅读更多关于 how to view tf-idf score against each word

问题 I was trying to know the tf-idf scores of each word in my document. However, it only returns values in the matrix but I see a specific type of representation of tf-idf scores against each word. I have used processed and the code works however I want to change the way it is presented: code: from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer bow_transformer = CountVectorizer(analyzer=text_process).fit(df["comments"].head())

Extract Main- and Subclauses from German Sentence with SpaCy

阅读更多关于 Extract Main- and Subclauses from German Sentence with SpaCy

问题 In German, how can I extract the main- and subclauses (aka "subordinate clauses", "dependent clauses") from a sentence with SpaCy? I know how to use SpaCy's tokenizer, part-of-speech tagging and dependency parser, but I cannot figure out how to represent the grammatical rules of German using the information SpaCy can extract. 回答1: The problem can be divided into two tasks: 1. Splitting the sentence in its constituting clauses and 2. Identifying which of the clauses is a main clause and which

Keyword in context (kwic) for skipgrams?

阅读更多关于 Keyword in context (kwic) for skipgrams?

问题 I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry. The following code a kwic object which is empty but I don't know what I did wrong. dcc.corpus refers to the text document. I also used the tokenized version but nothing changes. The result is: "kwic object with 0 rows" x <- tokens("barriers entry") ntoken_test <- tokens_ngrams(x, n = 2,