nlp

NLP(natural language processing) How to detect question with any method?

流过昼夜 提交于 2020-04-12 07:34:47
问题 I search a machine learning method detecting some question. Example, User: Please tell me your name ? AI : (AI find User want to know his name) My name is [AI's name]. My dataset is as follows. [label], [question] 1 , What's your name? 1 , Tell me your name. ... But the problem is to include something that is not a question in the input. Example, User: Hello, my name is [User name] AI : (this is not a question) (throw another process) (->) Nice to meet you. The number of Question's categories

SpaCy — intra-word hyphens. How to treat them one word?

天涯浪子 提交于 2020-04-11 06:31:23
问题 Following is the code provided as answer to the question; import spacy from spacy.tokenizer import Tokenizer from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex import re nlp = spacy.load('en') infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)") infix_re = spacy.util.compile_infix_regex(infixes) def custom_tokenizer(nlp): return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer) nlp.tokenizer = custom_tokenizer(nlp) s1 = "Marketing

NLP in Python: Obtain word names from SelectKBest after vectorizing

牧云@^-^@ 提交于 2020-04-11 06:30:10
问题 I can't seem to find an answer to my exact problem. Can anyone help? A simplified description of my dataframe ("df"): It has 2 columns: one is a bunch of text ("Notes"), and the other is a binary variable indicating if the resolution time was above average or not ("y"). I did bag-of-words on the text: from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(lowercase=True, stop_words="english") matrix = vectorizer.fit_transform(df["Notes"]) My matrix is 6290 x

NLP in Python: Obtain word names from SelectKBest after vectorizing

偶尔善良 提交于 2020-04-11 06:28:08
问题 I can't seem to find an answer to my exact problem. Can anyone help? A simplified description of my dataframe ("df"): It has 2 columns: one is a bunch of text ("Notes"), and the other is a binary variable indicating if the resolution time was above average or not ("y"). I did bag-of-words on the text: from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(lowercase=True, stop_words="english") matrix = vectorizer.fit_transform(df["Notes"]) My matrix is 6290 x

Document similarity: Vector embedding versus Tf-Idf performance?

允我心安 提交于 2020-04-09 18:37:25
问题 I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches: A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity. Bag-of-Words: tf-idf or its variations such as BM25. Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document

Document similarity: Vector embedding versus Tf-Idf performance?

时光怂恿深爱的人放手 提交于 2020-04-09 18:36:07
问题 I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches: A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity. Bag-of-Words: tf-idf or its variations such as BM25. Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document

How to visualize attention weights?

淺唱寂寞╮ 提交于 2020-04-08 06:59:06
问题 Using this implementation I have included attention to my RNN (which classify the input sequences into two classes) as follows. visible = Input(shape=(250,)) embed=Embedding(vocab_size,100)(visible) activations= keras.layers.GRU(250, return_sequences=True)(embed) attention = TimeDistributed(Dense(1, activation='tanh'))(activations) attention = Flatten()(attention) attention = Activation('softmax')(attention) attention = RepeatVector(250)(attention) attention = Permute([2, 1])(attention) sent

Using keras tokenizer for new words not in training set

大兔子大兔子 提交于 2020-04-08 02:03:07
问题 I'm currently using the Keras Tokenizer to create a word index and then matching that word index to the the imported GloVe dictionary to create an embedding matrix. However, the problem I have is that this seems to defeat one of the advantages of using a word vector embedding since when using the trained model for predictions if it runs into a new word that's not in the tokenizer's word index it removes it from the sequence. #fit the tokenizer tokenizer = Tokenizer() tokenizer.fit_on_texts

Using keras tokenizer for new words not in training set

混江龙づ霸主 提交于 2020-04-08 01:59:07
问题 I'm currently using the Keras Tokenizer to create a word index and then matching that word index to the the imported GloVe dictionary to create an embedding matrix. However, the problem I have is that this seems to defeat one of the advantages of using a word vector embedding since when using the trained model for predictions if it runs into a new word that's not in the tokenizer's word index it removes it from the sequence. #fit the tokenizer tokenizer = Tokenizer() tokenizer.fit_on_texts

关于数据可视化

不问归期 提交于 2020-03-26 07:42:08
最近想整理一些数据可视化在模型特征选择,特征加工,模型选择方面的文章, 再就是nlp领域内的数据可视化。。。 因为想自己学快速的学习中医,所以对nlp非常感兴趣,准备多研究一下这个。 先立个帖子。。。 1、数据可视化的重要性 安斯库姆四重奏 (Anscombe's Quartet)。 建议大家搜一下这个,我上学时老师咋没给我讲这个案例呢????? 来源: https://www.cnblogs.com/SSSR/p/10924423.html