nlp | 易学教程

Google NLP AutoML Java client The provided location ID is not valid

阅读更多关于 Google NLP AutoML Java client The provided location ID is not valid

问题 I checked out from GitHub Java samples for GoogleCloudPlatform. I am trying to run this example for AutoML NLP prediction after I successfully trained my language model. I am able to perform prediction in the Google Cloud Console. Now I am trying to perform prediction from Java client with this example https://github.com/GoogleCloudPlatform/java-docs-samples/blob/master/language/automl/src/main/java/com/google/cloud/language/samples/PredictionApi.java I created service account for my project,

Applying Tfidfvectorizer on list of pos tags gives ValueError

阅读更多关于 Applying Tfidfvectorizer on list of pos tags gives ValueError

问题 After preprocessing, I have list of pos tags in a pandas column as below. I want to vectorize these tags and generate a matrix using Tfidfvectorizer or any other vectorizer. dataset['text_posTagged'] ['VBP', 'JJ', 'NNS', 'VBP', 'JJ', 'IN', 'PRP', 'VBP', 'TO', 'VB', 'PRP', 'RB', 'VBZ', 'DT', 'JJ', 'PRP$', 'NN', 'NN', 'NN', 'NN', 'VBZ', 'JJ'] ['UH', 'DT', 'VB', 'VB', 'PRP$', 'NN', 'TO', 'JJ', 'IN', 'PRP', 'MD', 'VB', 'DT', 'VBZ', 'DT', 'NN', 'NN'] ['NN', 'VBD', 'NN', 'NN', 'NN', 'DT', 'IN', 'IN

Keras : addition layer for embeddings / vectors?

阅读更多关于 Keras : addition layer for embeddings / vectors?

问题 I have 3 word embeddings : embedding#1 : [w11, w12, w13, w14] embedding#2 : [w21, w22, w23, w24] embedding#3 : [w31, w32, w33, w34] Is there a way to get a fourth embedding by adding all three vectors, with the trainable weights from all of them, like: embedding#4 : [w11 + w21 + w31, w12 + w22 + w32, w13 + w23 + w33, w14 + w24 + w34] ? Is there a way to do this in a keras layer? Problem I want to learn the word embeddings for Indonesian language. I plan to do this by training a sequence

Natural language processing with elastic search

阅读更多关于 Natural language processing with elastic search

问题 I wanted to integrate search functionality in my website. I am using elastic search for it, If user searches "Maruti suzkuki under 2 lac" then it has to search cars having brand name "Maruti Suzuki" and price under 2 lac. How can I achieve this? Types of searches maruti suzuki under 2 lac maruti suzuki 20000km driven cars 2015 year model etc. ES version 5.4 来源： https://stackoverflow.com/questions/44187481/natural-language-processing-with-elastic-search

How to save the tensorflow's word2vec in text/binary file for later use of kNN output?

阅读更多关于 How to save the tensorflow's word2vec in text/binary file for later use of kNN output?

问题 I have trained a word2vec model in tensorflow. But when I save the session, it only outputted model.ckpt.data / .index / .meta files. I was thinking of implementing KNN method in retrieving nearest words. I saw answers of using gensim, but how can I save my tensorflow word2vec model into .txt first? 回答1: Simply evaluate the embeddings matrix into a numpy array and write it to the file along with resolved words. Sample code: vocabulary_size = 50000 embedding_size = 128 # Assume your word to

What does the embedding layer for a network looks like?

阅读更多关于 What does the embedding layer for a network looks like?

问题 I just start with text classification, and I got stuck in the embedding layer. If I have a batch of sequences encoded as integer corresponding to each word, what does the embedding layer looks like? Is there neurons like normal neural layer? I've seen the keras.layers.Embedding , but after looking for the document I'm really confused about how does it works. I can understand input_dim , but why is output_dim a 2D matrix? How many weights do I have in this embedding layer? I'm sorry if my

Gensim equivalent of training steps

阅读更多关于 Gensim equivalent of training steps

问题 Does gensim Word2Vec have an option that is the equivalent of "training steps" in the TensorFlow word2vec example here: Word2Vec Basic? If not, what default value does gensim use? Is the gensim parameter iter related to training steps? The TensorFlow script includes this section. with tf.Session(graph=graph) as session: # We must initialize all variables before we use them. init.run() print('Initialized') average_loss = 0 for step in xrange(num_steps): batch_inputs, batch_labels = generate

What is the difference between TfidfVectorizer.fit_transfrom and tfidf.transform?

阅读更多关于 What is the difference between TfidfVectorizer.fit_transfrom and tfidf.transform?

问题 In Tfidf.fit_transform we are only using the parameters X and have not used y for fitting the data set. Is this right? We are generating the tfidf matrix for only parameters of the training set.We are not using ytrain in fitting the model. Then how do we make predictions for the test data set 回答1: https://datascience.stackexchange.com/a/12346/122 has a good explanation of why it's call fit() , transform() and fit_transform() . In gist, fit() : Fit the vectorizer/model to the training data and

Given a huge set of street names, what is the most efficient way to test whether a text contains one of the street names from the set?

阅读更多关于 Given a huge set of street names, what is the most efficient way to test whether a text contains one of the street names from the set?

问题 I have an interesting problem that I need help with. I am currently working on a feature of my program and stumbled into this issues I have a huge list of street names in Indonesia ( > 100k rows ) stored in database, Each street name may have more than 1 word. For example : "Sudirman", "Gatot Subroto", or "Jalan Asia Afrika" are all legit street names have a bunch of texts ( > 1 Million rows ) in databases, that I split into sentences. Now, the features ( function to be exact ) that I need to

Java CFG parser that supports ambiguities

阅读更多关于 Java CFG parser that supports ambiguities

问题 I'm looking for a CFG parser implemented with Java. The thing is I'm trying to parse a natural language. And I need all possible parse trees (ambiguity) not only one of them. I already researched many NLP parsers such as Stanford parser. But they mostly require statistical data (a treebank which I don't have) and it is rather difficult and poorly documented to adapt them in to a new language. I found some parser generators such as ANTRL or JFlex but I'm not sure that they can handle