nlp

Approaches to improve Microsoft ChatBot with each user conversation by learning from it?

浪子不回头ぞ 提交于 2019-12-24 20:46:23
问题 I am building a Microsoft ChatBot using LUIS for natural language processing. I would like LUIS to improve by learning new utterences for the intents identified. For example, if my 'Greeting' intent has utterences 'Hi', 'Hello', 'Hello, how are you?', the next time it encounters 'How are you?', it may predict the intent as 'Greeting' with a low accuracy. If that utterance is learnt as part of the intent, then in future, this utterence will be predicted with better accuracy and also help us in

Splitting and grouping plain text (grouping text by chapter in dataframe)?

强颜欢笑 提交于 2019-12-24 20:45:28
问题 I have a data frame/tibble where I've imported a file of plain text (txt). The text very consistent and is grouped by chapter. Sometimes the chapter text is only one row, sometimes it's multiple row. Data is in one column like this: # A tibble: 10,708 x 1 x <chr> 1 "Chapter 1 " 2 "Chapter text. " 3 "Chapter 2 " 4 "Chapter text. " 5 "Chapter 3 " 6 "Chapter text. " 7 "Chapter text. " 8 "Chapter 4 " I'm trying to clean the data to have a new column for Chapter and the text from each chapter in

Using nlp.pipe() with pre-segmented and pre-tokenized text with spaCy

倾然丶 夕夏残阳落幕 提交于 2019-12-24 19:39:01
问题 I am trying to tag and parse text that has already been split up in sentences and has already been tokenized. As an example: sents = [['I', 'like', 'cookies', '.'], ['Do', 'you', '?']] The fastest approach to process batches of text is .pipe() . However, it is not clear to me how I can use that with pre-tokenized, and pre-segmented text. Performance is key here. I tried the following, but that threw an error docs = [nlp.tokenizer.tokens_from_list(sentence) for sentence in sents] nlp.tagger

Extract Graph from DBpedia, by number of HOPS, Direction

我的梦境 提交于 2019-12-24 18:58:15
问题 In the above graph [which is in dbpedia], I want to extract information about TIM COOK, with hops, IF I put hop as 1, I would need all the first level information about TIM COOK, like Masters, APPLE, Car If hops are 2, I need Masters, APPLE, Car, United States Is there any way I could extract such graph ? I would like to pass direction (Incoming, Outgoing) also to extract graph. Could you please help me with SPARQL query? 来源: https://stackoverflow.com/questions/54774104/extract-graph-from

How to standardize the bag of words for train and test?

此生再无相见时 提交于 2019-12-24 18:57:00
问题 I am trying to classify based on the bag-of-words model from NLP. Did pre-processing of the train data using NLTK (punctuation, stop words removal, lower case, stemming etc.) Created tf-idf matrix for train. Did pre-processing of test. Created tf-idf matrix for test data. Train and Test data have different bag of words so the no of features are different, so we cannot use a classification algo like knn. I merged the train and test data together and created the tf-idf matrix. This solved the

Keras: Input layer and passing input data correctly

大兔子大兔子 提交于 2019-12-24 18:46:57
问题 I am learning to use Keras functional API and I have managed to build and compile a model. But when I call the model.fit passing the data X and labels y , I got an error. It seems I still haven't got the idea of how it works. The task is classifying sentences into 6 types, and the code goes: X_ = ... # shape: (2787, 100) each row a sentence and each column a feature y_= ... # shape: (2787,) word_matrix_weights= ... # code to initiate a lookup matrix for vocabulary embeddings. shape: (9825,300

Gensim doc2vec file stream training worse performance

大兔子大兔子 提交于 2019-12-24 18:38:55
问题 Recently I switched to gensim 3.6 and the main reason was the optimized training process, which streams the training data directly from file, thus avoiding the GIL performance penalties. This is how I used to trin my doc2vec: training_iterations = 20 d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0) d2v.build_vocab(corpus) for epoch in range(training_iterations): d2v.train(corpus, total_examples=d2v.corpus_count, epochs=d2v.iter) d2v.alpha -= 0.0002 d2v

Using Word2Vec for polysemy solving problems

旧街凉风 提交于 2019-12-24 17:50:15
问题 I have some questions about Word2Vec: What determines the dimension of the result model vectors? What is elements of this vectors? Can I use Word2Vec for polysemy solving problems (state = administrative unit vs state = condition), if I already have texts for every meaning of words? 回答1: (1) You pick the desired dimensionality, as a meta-parameter of the model. Rigorous projects with enough time may try different sizes, to see what works best for their qualitative evaluations. (2) Individual

Inference with tensorflow checkpoints

梦想的初衷 提交于 2019-12-24 17:48:00
问题 I am feeding characters ( x_train ) to the RNN model defined in example 13 of this link. Here is the code corresponding to model definition, input pre-processing and training. def char_rnn_model(features, target): """Character level recurrent neural network model to predict classes.""" target = tf.one_hot(target, 15, 1, 0) #byte_list = tf.one_hot(features, 256, 1, 0) byte_list = tf.cast(tf.one_hot(features, 256, 1, 0), dtype=tf.float32) byte_list = tf.unstack(byte_list, axis=1) cell = tf

Stanford Entity Recognizer (caseless) in Python Nltk

断了今生、忘了曾经 提交于 2019-12-24 17:43:19
问题 I am trying to figure out how to use the caseless version of the entity recognizer from NLTK. I downloaded http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip and placed it in the site-packages folder of python. Then I downloaded http://nlp.stanford.edu/software/stanford-corenlp-caseless-2015-04-20-models.jar and placed it in the folder. Then I ran this code in NLTK from nltk.tag.stanford import NERTagger english_nertagger = NERTagger(‘/home/anaconda/lib/python2.7/site-packages