nlp | 易学教程

NLP(natural language processing) How to detect question with any method?

阅读更多关于 NLP(natural language processing) How to detect question with any method?

问题 I search a machine learning method detecting some question. Example, User: Please tell me your name ? AI : (AI find User want to know his name) My name is [AI's name]. My dataset is as follows. [label], [question] 1 , What's your name? 1 , Tell me your name. ... But the problem is to include something that is not a question in the input. Example, User: Hello, my name is [User name] AI : (this is not a question) (throw another process) (->) Nice to meet you. The number of Question's categories

SpaCy — intra-word hyphens. How to treat them one word?

阅读更多关于 SpaCy — intra-word hyphens. How to treat them one word?

问题 Following is the code provided as answer to the question; import spacy from spacy.tokenizer import Tokenizer from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex import re nlp = spacy.load('en') infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)") infix_re = spacy.util.compile_infix_regex(infixes) def custom_tokenizer(nlp): return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer) nlp.tokenizer = custom_tokenizer(nlp) s1 = "Marketing

NLP in Python: Obtain word names from SelectKBest after vectorizing

阅读更多关于 NLP in Python: Obtain word names from SelectKBest after vectorizing

问题 I can't seem to find an answer to my exact problem. Can anyone help? A simplified description of my dataframe ("df"): It has 2 columns: one is a bunch of text ("Notes"), and the other is a binary variable indicating if the resolution time was above average or not ("y"). I did bag-of-words on the text: from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(lowercase=True, stop_words="english") matrix = vectorizer.fit_transform(df["Notes"]) My matrix is 6290 x

NLP in Python: Obtain word names from SelectKBest after vectorizing

阅读更多关于 NLP in Python: Obtain word names from SelectKBest after vectorizing

Document similarity: Vector embedding versus Tf-Idf performance?

阅读更多关于 Document similarity: Vector embedding versus Tf-Idf performance?

问题 I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches: A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity. Bag-of-Words: tf-idf or its variations such as BM25. Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document

Document similarity: Vector embedding versus Tf-Idf performance?

阅读更多关于 Document similarity: Vector embedding versus Tf-Idf performance?

How to visualize attention weights?

阅读更多关于 How to visualize attention weights?

问题 Using this implementation I have included attention to my RNN (which classify the input sequences into two classes) as follows. visible = Input(shape=(250,)) embed=Embedding(vocab_size,100)(visible) activations= keras.layers.GRU(250, return_sequences=True)(embed) attention = TimeDistributed(Dense(1, activation='tanh'))(activations) attention = Flatten()(attention) attention = Activation('softmax')(attention) attention = RepeatVector(250)(attention) attention = Permute([2, 1])(attention) sent

Using keras tokenizer for new words not in training set

阅读更多关于 Using keras tokenizer for new words not in training set

问题 I'm currently using the Keras Tokenizer to create a word index and then matching that word index to the the imported GloVe dictionary to create an embedding matrix. However, the problem I have is that this seems to defeat one of the advantages of using a word vector embedding since when using the trained model for predictions if it runs into a new word that's not in the tokenizer's word index it removes it from the sequence. #fit the tokenizer tokenizer = Tokenizer() tokenizer.fit_on_texts

Using keras tokenizer for new words not in training set

阅读更多关于 Using keras tokenizer for new words not in training set

关于数据可视化

阅读更多关于关于数据可视化

最近想整理一些数据可视化在模型特征选择，特征加工，模型选择方面的文章，再就是nlp领域内的数据可视化。。。因为想自己学快速的学习中医，所以对nlp非常感兴趣，准备多研究一下这个。先立个帖子。。。 1、数据可视化的重要性安斯库姆四重奏 (Anscombe's Quartet)。建议大家搜一下这个，我上学时老师咋没给我讲这个案例呢？？？？？来源： https://www.cnblogs.com/SSSR/p/10924423.html