nlp

Google NLP AutoML Java client The provided location ID is not valid

独自空忆成欢 提交于 2019-12-11 08:04:06
问题 I checked out from GitHub Java samples for GoogleCloudPlatform. I am trying to run this example for AutoML NLP prediction after I successfully trained my language model. I am able to perform prediction in the Google Cloud Console. Now I am trying to perform prediction from Java client with this example https://github.com/GoogleCloudPlatform/java-docs-samples/blob/master/language/automl/src/main/java/com/google/cloud/language/samples/PredictionApi.java I created service account for my project,

Applying Tfidfvectorizer on list of pos tags gives ValueError

不想你离开。 提交于 2019-12-11 07:52:05
问题 After preprocessing, I have list of pos tags in a pandas column as below. I want to vectorize these tags and generate a matrix using Tfidfvectorizer or any other vectorizer. dataset['text_posTagged'] ['VBP', 'JJ', 'NNS', 'VBP', 'JJ', 'IN', 'PRP', 'VBP', 'TO', 'VB', 'PRP', 'RB', 'VBZ', 'DT', 'JJ', 'PRP$', 'NN', 'NN', 'NN', 'NN', 'VBZ', 'JJ'] ['UH', 'DT', 'VB', 'VB', 'PRP$', 'NN', 'TO', 'JJ', 'IN', 'PRP', 'MD', 'VB', 'DT', 'VBZ', 'DT', 'NN', 'NN'] ['NN', 'VBD', 'NN', 'NN', 'NN', 'DT', 'IN', 'IN

Keras : addition layer for embeddings / vectors?

瘦欲@ 提交于 2019-12-11 07:42:26
问题 I have 3 word embeddings : embedding#1 : [w11, w12, w13, w14] embedding#2 : [w21, w22, w23, w24] embedding#3 : [w31, w32, w33, w34] Is there a way to get a fourth embedding by adding all three vectors, with the trainable weights from all of them, like: embedding#4 : [w11 + w21 + w31, w12 + w22 + w32, w13 + w23 + w33, w14 + w24 + w34] ? Is there a way to do this in a keras layer? Problem I want to learn the word embeddings for Indonesian language. I plan to do this by training a sequence

Natural language processing with elastic search

雨燕双飞 提交于 2019-12-11 07:30:04
问题 I wanted to integrate search functionality in my website. I am using elastic search for it, If user searches "Maruti suzkuki under 2 lac" then it has to search cars having brand name "Maruti Suzuki" and price under 2 lac. How can I achieve this? Types of searches maruti suzuki under 2 lac maruti suzuki 20000km driven cars 2015 year model etc. ES version 5.4 来源: https://stackoverflow.com/questions/44187481/natural-language-processing-with-elastic-search

How to save the tensorflow's word2vec in text/binary file for later use of kNN output?

别等时光非礼了梦想. 提交于 2019-12-11 07:25:33
问题 I have trained a word2vec model in tensorflow. But when I save the session, it only outputted model.ckpt.data / .index / .meta files. I was thinking of implementing KNN method in retrieving nearest words. I saw answers of using gensim, but how can I save my tensorflow word2vec model into .txt first? 回答1: Simply evaluate the embeddings matrix into a numpy array and write it to the file along with resolved words. Sample code: vocabulary_size = 50000 embedding_size = 128 # Assume your word to

What does the embedding layer for a network looks like?

99封情书 提交于 2019-12-11 07:19:29
问题 I just start with text classification, and I got stuck in the embedding layer. If I have a batch of sequences encoded as integer corresponding to each word, what does the embedding layer looks like? Is there neurons like normal neural layer? I've seen the keras.layers.Embedding , but after looking for the document I'm really confused about how does it works. I can understand input_dim , but why is output_dim a 2D matrix? How many weights do I have in this embedding layer? I'm sorry if my

Gensim equivalent of training steps

江枫思渺然 提交于 2019-12-11 07:01:35
问题 Does gensim Word2Vec have an option that is the equivalent of "training steps" in the TensorFlow word2vec example here: Word2Vec Basic? If not, what default value does gensim use? Is the gensim parameter iter related to training steps? The TensorFlow script includes this section. with tf.Session(graph=graph) as session: # We must initialize all variables before we use them. init.run() print('Initialized') average_loss = 0 for step in xrange(num_steps): batch_inputs, batch_labels = generate

What is the difference between TfidfVectorizer.fit_transfrom and tfidf.transform?

感情迁移 提交于 2019-12-11 06:47:25
问题 In Tfidf.fit_transform we are only using the parameters X and have not used y for fitting the data set. Is this right? We are generating the tfidf matrix for only parameters of the training set.We are not using ytrain in fitting the model. Then how do we make predictions for the test data set 回答1: https://datascience.stackexchange.com/a/12346/122 has a good explanation of why it's call fit() , transform() and fit_transform() . In gist, fit() : Fit the vectorizer/model to the training data and

Given a huge set of street names, what is the most efficient way to test whether a text contains one of the street names from the set?

限于喜欢 提交于 2019-12-11 06:29:34
问题 I have an interesting problem that I need help with. I am currently working on a feature of my program and stumbled into this issues I have a huge list of street names in Indonesia ( > 100k rows ) stored in database, Each street name may have more than 1 word. For example : "Sudirman", "Gatot Subroto", or "Jalan Asia Afrika" are all legit street names have a bunch of texts ( > 1 Million rows ) in databases, that I split into sentences. Now, the features ( function to be exact ) that I need to

Java CFG parser that supports ambiguities

橙三吉。 提交于 2019-12-11 06:16:14
问题 I'm looking for a CFG parser implemented with Java. The thing is I'm trying to parse a natural language. And I need all possible parse trees (ambiguity) not only one of them. I already researched many NLP parsers such as Stanford parser. But they mostly require statistical data (a treebank which I don't have) and it is rather difficult and poorly documented to adapt them in to a new language. I found some parser generators such as ANTRL or JFlex but I'm not sure that they can handle