nlp | 易学教程

Clustering Strings Based on Similar Word Sequences

阅读更多关于 Clustering Strings Based on Similar Word Sequences

问题 I am looking for an efficient way to cluster about 10 million strings into clusters based on the appearance of similar word sequences. Consider a list of strings like: the fruit hut number one the ice cre am shop number one jim's taco ice cream shop in the corner the ice cream shop the fruit hut jim's taco outlet number one jim's t aco in the corner the fruit hut in the corner After the algorithm runs on them I want them clustered as follows: the ice cre am shop number one ice cream shop in

How to search for specific n-grams in a corpus using R

阅读更多关于 How to search for specific n-grams in a corpus using R

问题 I'm looking for specific n-grams in a corpus. Let's say I want to find 'asset management' and 'historical yield' in a collection of documents. This is how I loaded the corpus my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"), readerControl = list(reader = readPDF) I cleaned the corpus and did some basic calculations using document term matrices. Now I want to look for particular expressions and put them in a dataframe. This is what I use (thanks to phiver): ngrams <- c('asset

Python regex: tokenizing English contractions

阅读更多关于 Python regex: tokenizing English contractions

问题 I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of "shouldn't" would be ["should", "n't"]. The nltk module does not seem to be up to the task however as: "I wouldn't've done that." tokenizes as: ['I', "wouldn't", "'ve", 'done', 'that', '.'] where the desired tokenization of "wouldn't've" was: ['would', "n't", "'ve"] After examining common English contractions, I am trying to write a regex to

How to get phrase tags in Stanford CoreNLP?

阅读更多关于 How to get phrase tags in Stanford CoreNLP?

问题 If I want to get phrase tags corresponding each word, how to I get this? For example : In this sentence, My dog also likes eating sausage. I can get a parse tree in Stanford NLP such as (ROOT (S (NP (PRP$ My) (NN dog)) (ADVP (RB also)) (VP (VBZ likes) (NP (JJ eating) (NN sausage))) (. .))) In the above situtation, I want to get phrase tags corresponding each word like (My - NP), (dog - NP), (also - ADVP), (likes - VP), ... Is there any method for this simple extraction for phrase tags? Please

How to output NLTK chunks to file?

阅读更多关于 How to output NLTK chunks to file?

问题 I have this python script where I am using nltk library to parse,tokenize,tag and chunk some lets say random text from the web. I need to format and write in a file the output of chunked1 , chunked2 , chunked3 . These have type class 'nltk.tree.Tree' More specifically I need to write only the lines that match the regular expressions chunkGram1 , chunkGram2 , chunkGram3 . How can i do that? #! /usr/bin/python2.7 import nltk import re import codecs xstring = ["An electronic library (also

Perl - find and save in an associative array word and word context

阅读更多关于 Perl - find and save in an associative array word and word context

问题 I have an array like this (it's just a little overview but it has 2000 and more lines like this): @list = ( "affaire,chose,question", "cause,chose,matière", ); I'd like to have this output: %te = ( affaire => "chose", "question", chose => "affaire", "question", "cause", "matière", question => "affaire", "chose", cause => "chose", "matière", matière => "cause", "chose" ); I've created this script but it doesn't work very well and I think is too much complicated.. use Data::Dumper; @list = (

Running .exe on Azure

阅读更多关于 Running .exe on Azure

问题 I have a flask web app that is published on azure. In my project I have a 'senna-win32.exe' that takes in input and sends out some output. My code for calling this .exe looks like this: senna_path = 'senna-win32.exe' p = subprocess.Popen(senna_path,stdout=subprocess.PIPE,stdin=subprocess.PIPE, stderr=subprocess.PIPE) stdout = p.communicate(input=bytes(userInput, 'utf-8'))[0] inList = stdout.decode() It seems to work on my local pc, but on azure, it doesn't raise any issues but does nothing.

Keras - how to get unnormalized logits instead of probabilities

阅读更多关于 Keras - how to get unnormalized logits instead of probabilities

问题 I am creating a model in Keras and want to compute my own metric (perplexity). This requires using the unnormalized probabilities/logits. However, the keras model only returns the softmax probabilties: model = Sequential() model.add(embedding_layer) model.add(LSTM(n_hidden, return_sequences=False)) model.add(Dropout(dropout_keep_prob)) model.add(Dense(vocab_size)) model.add(Activation('softmax')) optimizer = RMSprop(lr=self.lr) model.compile(optimizer=optimizer, loss='sparse_categorical

Installing rasa on Windows

阅读更多关于 Installing rasa on Windows

问题 I am trying to install rasa on Windows 10. I am done installing Python 3.6 and pip packege. When I am running pip install rasa_nlu I am getting the following error: c:\program files (x86)\python36-32\include\pyconfig.h(222): fatal error C1083: Cannot open include file: 'basetsd.h': No such file or directory error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\cl.exe' failed with exit status 2 I have tried most of the solutions like reinstalling Microsoft

Clustering from the cosine similarity values

阅读更多关于 Clustering from the cosine similarity values

问题 I have extracted words from a set of URLs and calculated cosine similarity between each URL's contents.And also I have normalized the values between 0-1(using Min-Max).Now i need to cluster the URLs based on cosine similarity values to find out similar URLs.which clustering algorithm will be most suitable?.Please suggest me a Dynamic clustering method because it will be useful since i could increase number of URL's on demand and also it will be more natural.Please correct me if you feel i'm