nlp

Clustering Strings Based on Similar Word Sequences

大兔子大兔子 提交于 2019-12-31 04:42:31
问题 I am looking for an efficient way to cluster about 10 million strings into clusters based on the appearance of similar word sequences. Consider a list of strings like: the fruit hut number one the ice cre am shop number one jim's taco ice cream shop in the corner the ice cream shop the fruit hut jim's taco outlet number one jim's t aco in the corner the fruit hut in the corner After the algorithm runs on them I want them clustered as follows: the ice cre am shop number one ice cream shop in

How to search for specific n-grams in a corpus using R

独自空忆成欢 提交于 2019-12-31 03:57:08
问题 I'm looking for specific n-grams in a corpus. Let's say I want to find 'asset management' and 'historical yield' in a collection of documents. This is how I loaded the corpus my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"), readerControl = list(reader = readPDF) I cleaned the corpus and did some basic calculations using document term matrices. Now I want to look for particular expressions and put them in a dataframe. This is what I use (thanks to phiver): ngrams <- c('asset

Python regex: tokenizing English contractions

谁都会走 提交于 2019-12-31 01:59:06
问题 I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of "shouldn't" would be ["should", "n't"]. The nltk module does not seem to be up to the task however as: "I wouldn't've done that." tokenizes as: ['I', "wouldn't", "'ve", 'done', 'that', '.'] where the desired tokenization of "wouldn't've" was: ['would', "n't", "'ve"] After examining common English contractions, I am trying to write a regex to

How to get phrase tags in Stanford CoreNLP?

假如想象 提交于 2019-12-31 01:50:10
问题 If I want to get phrase tags corresponding each word, how to I get this? For example : In this sentence, My dog also likes eating sausage. I can get a parse tree in Stanford NLP such as (ROOT (S (NP (PRP$ My) (NN dog)) (ADVP (RB also)) (VP (VBZ likes) (NP (JJ eating) (NN sausage))) (. .))) In the above situtation, I want to get phrase tags corresponding each word like (My - NP), (dog - NP), (also - ADVP), (likes - VP), ... Is there any method for this simple extraction for phrase tags? Please

How to output NLTK chunks to file?

泪湿孤枕 提交于 2019-12-31 00:03:49
问题 I have this python script where I am using nltk library to parse,tokenize,tag and chunk some lets say random text from the web. I need to format and write in a file the output of chunked1 , chunked2 , chunked3 . These have type class 'nltk.tree.Tree' More specifically I need to write only the lines that match the regular expressions chunkGram1 , chunkGram2 , chunkGram3 . How can i do that? #! /usr/bin/python2.7 import nltk import re import codecs xstring = ["An electronic library (also

Perl - find and save in an associative array word and word context

此生再无相见时 提交于 2019-12-30 11:15:16
问题 I have an array like this (it's just a little overview but it has 2000 and more lines like this): @list = ( "affaire,chose,question", "cause,chose,matière", ); I'd like to have this output: %te = ( affaire => "chose", "question", chose => "affaire", "question", "cause", "matière", question => "affaire", "chose", cause => "chose", "matière", matière => "cause", "chose" ); I've created this script but it doesn't work very well and I think is too much complicated.. use Data::Dumper; @list = (

Running .exe on Azure

久未见 提交于 2019-12-30 07:58:08
问题 I have a flask web app that is published on azure. In my project I have a 'senna-win32.exe' that takes in input and sends out some output. My code for calling this .exe looks like this: senna_path = 'senna-win32.exe' p = subprocess.Popen(senna_path,stdout=subprocess.PIPE,stdin=subprocess.PIPE, stderr=subprocess.PIPE) stdout = p.communicate(input=bytes(userInput, 'utf-8'))[0] inList = stdout.decode() It seems to work on my local pc, but on azure, it doesn't raise any issues but does nothing.

Keras - how to get unnormalized logits instead of probabilities

时光怂恿深爱的人放手 提交于 2019-12-30 07:06:12
问题 I am creating a model in Keras and want to compute my own metric (perplexity). This requires using the unnormalized probabilities/logits. However, the keras model only returns the softmax probabilties: model = Sequential() model.add(embedding_layer) model.add(LSTM(n_hidden, return_sequences=False)) model.add(Dropout(dropout_keep_prob)) model.add(Dense(vocab_size)) model.add(Activation('softmax')) optimizer = RMSprop(lr=self.lr) model.compile(optimizer=optimizer, loss='sparse_categorical

Installing rasa on Windows

不羁岁月 提交于 2019-12-30 05:28:06
问题 I am trying to install rasa on Windows 10. I am done installing Python 3.6 and pip packege. When I am running pip install rasa_nlu I am getting the following error: c:\program files (x86)\python36-32\include\pyconfig.h(222): fatal error C1083: Cannot open include file: 'basetsd.h': No such file or directory error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\cl.exe' failed with exit status 2 I have tried most of the solutions like reinstalling Microsoft

Clustering from the cosine similarity values

…衆ロ難τιáo~ 提交于 2019-12-30 05:25:08
问题 I have extracted words from a set of URLs and calculated cosine similarity between each URL's contents.And also I have normalized the values between 0-1(using Min-Max).Now i need to cluster the URLs based on cosine similarity values to find out similar URLs.which clustering algorithm will be most suitable?.Please suggest me a Dynamic clustering method because it will be useful since i could increase number of URL's on demand and also it will be more natural.Please correct me if you feel i'm