nlp | 易学教程

Chunking English words into graphemes corresponding to distinct sounds

阅读更多关于 Chunking English words into graphemes corresponding to distinct sounds

问题 How to convert english input word into combinations of graphemes? Is there a library or function that does the job? What I'm looking for is an algorithm/implementation that splits orthographic words into segments which map to phonemes. That is, the sequence of letters in a word should be broken in between distinct sounds. To my mind, this would look something like the following: physically --> ph-y-s-i-c-a-ll-y psychology --> ps-y-ch-o-l-o-g-y thrush --> th-r-u-sh bought --> b-ough-t chew -->

Training a CNN with pre-trained word embeddings is very slow (TensorFlow)

阅读更多关于 Training a CNN with pre-trained word embeddings is very slow (TensorFlow)

问题 I'm using TensorFlow (0.6) to train a CNN on text data. I'm using a method similar to the second option specified in this SO thread (with the exception that the embeddings are trainable). My dataset is pretty small and the vocabulary is around 12,000 words. When I train using random word embeddings everything works nicely. However, when I switch to the pre-trained embeddings from the word2vec site, the vocabulary grows to over 3,000,000 words and training iterations become over 100 times

How could I deal with the sparse feature with high dimension in an SVR task?

阅读更多关于 How could I deal with the sparse feature with high dimension in an SVR task?

问题 I have a twitter-like(another micro blog) data set with 1.6 million datapoints and tried to predict the its retweet numbers based on its content. I extracted its keyword and use the keywords as the bag of words feature. Then I got 1.2 million dimension feature. The feature vector is very sparse,usually only ten dimension in one data point. And I use SVR to do the regression. Now it has taken 2 days. I think the training time might take quite a long time. I don't know if I do this task like

Normalize ranking score with weights

阅读更多关于 Normalize ranking score with weights

问题 I am working on a document search problem where given a set of documents and a search query I want to find the document closest to the query. The model that I am using is based on TfidfVectorizer in scikit. I created 4 different tf_idf vectors for all the documents by using 4 different types of tokenizers. Each tokenizer splits the string into n-grams where n is in the range 1 ... 4 . For example: doc_1 = "Singularity is still a confusing phenomenon in physics" doc_2 = "Quantum theory still

Find semantically similar word for natural language processing

阅读更多关于 Find semantically similar word for natural language processing

问题 I am working on a natural language processing project in Java. I have a requirement where I want to identify words that belong to similar semantic groups. e.g. : if the words such as study , university , graduate , attend are found I want them to be categorized as being related to education. If words such as golfer , batsman , athlete are found, it should categorize all under a parent like sportsperson. Is there a way I can achieve this task without using and training approach. Is there some

How to have ngram tokenizer in lucene 5.0?

阅读更多关于 How to have ngram tokenizer in lucene 5.0?

问题 I want to generate ngram characters for a string. Below is the Lucene 4.1 lib I used for it. Reader reader = new StringReader(text); NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 3, 5); //catch contiguous sequence of 3, 4 and 5 characters CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class); while (gramTokenizer.incrementToken()) { String token = charTermAttribute.toString(); System.out.println(token);} However, I want to use Lucene 5.0.0 to do

Unable to install textract

阅读更多关于 Unable to install textract

问题 Using the command pip install textract I'm unable to install textract on my Ubuntu 16.04, Python 2. I get the following error: Collecting textract Requirement already satisfied: python-pptx==0.6.5 in ./anaconda2/lib/python2.7/site-packages (from textract) (0.6.5) Requirement already satisfied: docx2txt==0.6 in ./anaconda2/lib/python2.7/site-packages (from textract) (0.6) Requirement already satisfied: six==1.10.0 in ./anaconda2/lib/python2.7/site-packages (from textract) (1.10.0) Requirement

Extracting only meaningful text from webpages

阅读更多关于 Extracting only meaningful text from webpages

问题 I am getting a list of urls and scraping them using nltk. My end result is in the form of a list with all the words on the webpage in a list. The trouble is that I am only looking for keywords and phrases that are not the usual english "sugar" words such as "as, and, like, to, am, for" etc etc. I know I can construct a file with all common english words and simply remove them from my scraped tokens list, but is there a built in feature for some library that does this automatically? I am

Extracting Function Tags from Parsed Sentence (using Stanford Parser)

阅读更多关于 Extracting Function Tags from Parsed Sentence (using Stanford Parser)

问题 Looking at the Penn Treebank tagset (http://web.mit.edu/6.863/www/PennTreebankTags.html#RB) there is a section called "Function Tags" that would be extremely helpful for a project I am working on. I know the Stanford Parser uses the Penn Treebank tagset for its EnglishPCFG grammar so I am hoping there is support for Function Tags. Using the Stanford Parser and NLTK I have parsed sentences with Clause, Phrase and Word level tags as well as Universal Dependencies, but I have not found a way to

what is the kind of max pooling in this nlp questionhierarchy description

阅读更多关于 what is the kind of max pooling in this nlp questionhierarchy description

问题 I'm trying to implement this description and that what I did I generated uni_gram & bi_gram & tri_gram of shape(?,15,512) "using padding " & then for each word I concatenate the three feature vector (?,3,512) and then I apply to them Globalmaxpooling1D I do not know if I implemented it well or not so can any one help me ? Q = Input(shape=(15,)) V = Input(shape=(512,196)) word_level = Embedding ( vocab_size , 512 , input_length=max_length)(Q) uni_gram = Conv1D( 512 , kernel_size = 1 ,