nlp | 易学教程

Approaches to improve Microsoft ChatBot with each user conversation by learning from it?

阅读更多关于 Approaches to improve Microsoft ChatBot with each user conversation by learning from it?

问题 I am building a Microsoft ChatBot using LUIS for natural language processing. I would like LUIS to improve by learning new utterences for the intents identified. For example, if my 'Greeting' intent has utterences 'Hi', 'Hello', 'Hello, how are you?', the next time it encounters 'How are you?', it may predict the intent as 'Greeting' with a low accuracy. If that utterance is learnt as part of the intent, then in future, this utterence will be predicted with better accuracy and also help us in

Splitting and grouping plain text (grouping text by chapter in dataframe)?

阅读更多关于 Splitting and grouping plain text (grouping text by chapter in dataframe)?

问题 I have a data frame/tibble where I've imported a file of plain text (txt). The text very consistent and is grouped by chapter. Sometimes the chapter text is only one row, sometimes it's multiple row. Data is in one column like this: # A tibble: 10,708 x 1 x <chr> 1 "Chapter 1 " 2 "Chapter text. " 3 "Chapter 2 " 4 "Chapter text. " 5 "Chapter 3 " 6 "Chapter text. " 7 "Chapter text. " 8 "Chapter 4 " I'm trying to clean the data to have a new column for Chapter and the text from each chapter in

Using nlp.pipe() with pre-segmented and pre-tokenized text with spaCy

阅读更多关于 Using nlp.pipe() with pre-segmented and pre-tokenized text with spaCy

问题 I am trying to tag and parse text that has already been split up in sentences and has already been tokenized. As an example: sents = [['I', 'like', 'cookies', '.'], ['Do', 'you', '?']] The fastest approach to process batches of text is .pipe() . However, it is not clear to me how I can use that with pre-tokenized, and pre-segmented text. Performance is key here. I tried the following, but that threw an error docs = [nlp.tokenizer.tokens_from_list(sentence) for sentence in sents] nlp.tagger

Extract Graph from DBpedia, by number of HOPS, Direction

阅读更多关于 Extract Graph from DBpedia, by number of HOPS, Direction

问题 In the above graph [which is in dbpedia], I want to extract information about TIM COOK, with hops, IF I put hop as 1, I would need all the first level information about TIM COOK, like Masters, APPLE, Car If hops are 2, I need Masters, APPLE, Car, United States Is there any way I could extract such graph ? I would like to pass direction (Incoming, Outgoing) also to extract graph. Could you please help me with SPARQL query? 来源： https://stackoverflow.com/questions/54774104/extract-graph-from

How to standardize the bag of words for train and test?

阅读更多关于 How to standardize the bag of words for train and test?

问题 I am trying to classify based on the bag-of-words model from NLP. Did pre-processing of the train data using NLTK (punctuation, stop words removal, lower case, stemming etc.) Created tf-idf matrix for train. Did pre-processing of test. Created tf-idf matrix for test data. Train and Test data have different bag of words so the no of features are different, so we cannot use a classification algo like knn. I merged the train and test data together and created the tf-idf matrix. This solved the

Keras: Input layer and passing input data correctly

阅读更多关于 Keras: Input layer and passing input data correctly

问题 I am learning to use Keras functional API and I have managed to build and compile a model. But when I call the model.fit passing the data X and labels y , I got an error. It seems I still haven't got the idea of how it works. The task is classifying sentences into 6 types, and the code goes: X_ = ... # shape: (2787, 100) each row a sentence and each column a feature y_= ... # shape: (2787,) word_matrix_weights= ... # code to initiate a lookup matrix for vocabulary embeddings. shape: (9825,300

Gensim doc2vec file stream training worse performance

阅读更多关于 Gensim doc2vec file stream training worse performance

问题 Recently I switched to gensim 3.6 and the main reason was the optimized training process, which streams the training data directly from file, thus avoiding the GIL performance penalties. This is how I used to trin my doc2vec: training_iterations = 20 d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0) d2v.build_vocab(corpus) for epoch in range(training_iterations): d2v.train(corpus, total_examples=d2v.corpus_count, epochs=d2v.iter) d2v.alpha -= 0.0002 d2v

Using Word2Vec for polysemy solving problems

阅读更多关于 Using Word2Vec for polysemy solving problems

问题 I have some questions about Word2Vec: What determines the dimension of the result model vectors? What is elements of this vectors? Can I use Word2Vec for polysemy solving problems (state = administrative unit vs state = condition), if I already have texts for every meaning of words? 回答1: (1) You pick the desired dimensionality, as a meta-parameter of the model. Rigorous projects with enough time may try different sizes, to see what works best for their qualitative evaluations. (2) Individual

Inference with tensorflow checkpoints

阅读更多关于 Inference with tensorflow checkpoints

问题 I am feeding characters ( x_train ) to the RNN model defined in example 13 of this link. Here is the code corresponding to model definition, input pre-processing and training. def char_rnn_model(features, target): """Character level recurrent neural network model to predict classes.""" target = tf.one_hot(target, 15, 1, 0) #byte_list = tf.one_hot(features, 256, 1, 0) byte_list = tf.cast(tf.one_hot(features, 256, 1, 0), dtype=tf.float32) byte_list = tf.unstack(byte_list, axis=1) cell = tf

Stanford Entity Recognizer (caseless) in Python Nltk

阅读更多关于 Stanford Entity Recognizer (caseless) in Python Nltk

问题 I am trying to figure out how to use the caseless version of the entity recognizer from NLTK. I downloaded http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip and placed it in the site-packages folder of python. Then I downloaded http://nlp.stanford.edu/software/stanford-corenlp-caseless-2015-04-20-models.jar and placed it in the folder. Then I ran this code in NLTK from nltk.tag.stanford import NERTagger english_nertagger = NERTagger(‘/home/anaconda/lib/python2.7/site-packages