nlp | 易学教程

How to use lemmatisation (LemmaGen) in C++

阅读更多关于 How to use lemmatisation (LemmaGen) in C++

问题 I'm using LemmaGen (http://lemmatise.ijs.si) for text lemmatisation. I've successfully used it by running the following statement in the command line. $lemmatize -l $./data/lemmatizer/lem-m-en.bin input.txt output.txt However, I actually want to use it as a library in my C++ project programatically. Any one knows how to use LemmaGen C++ API? Thanks! Or anyone can suggest other C++ lemmatisation library that can be used in C++ programmatically? Please correct me if I'm asking the question

Large scale naïve Bayes classifier with top-k output

阅读更多关于 Large scale naïve Bayes classifier with top-k output

问题 I need a library for naïve Bayes large scale, with millions of training examples and +100k binary features. It must be an online version (updatable after training). I also need top-k output, that is multiple classifications for a single instance. Accuracy is not very important. The purpose is an automatic text categorization application. Any suggestions for a good library is very appreciated. EDIT: The library should preferably be in Java. 回答1: If a learning algorithm other than naïve Bayes

Doc2vec: model.docvecs is only of length 10

阅读更多关于 Doc2vec: model.docvecs is only of length 10

问题 I am trying doc2vec for 600000 rows of sentences and my code is below: model = gensim.models.doc2vec.Doc2Vec(size= 100, min_count = 5,window=4, iter = 50, workers=cores) model.build_vocab(res) model.train(res, total_examples=model.corpus_count, epochs=model.iter) #len(res) = 663406 #length of unique words 15581 print(len(model.wv.vocab)) #length of doc vectors is 10 len(model.docvecs) # each of length 100 len(model.docvecs[1]) How do I interpret this result? why is the length of vector only

Jython: ImportError: No module named multiarray

阅读更多关于 Jython: ImportError: No module named multiarray

问题 When I try to call file and its method using Jython it shows the following error, while my Numpy, Python and NLTK is correctly installed and it works properly if I directly run directly from the Python shell File "C:\Python26\Lib\site-packages\numpy\core\__init__.py", line 5, in <module> import multiarray ImportError: No module named multiarray The code that I am using is simple one: PyInstance hello = ie.createClass("PreProcessing", "None"); PyString str = new PyString("my name is abcd");

How to get PMI scores for trigrams with NLTK Collocations? python

阅读更多关于 How to get PMI scores for trigrams with NLTK Collocations? python

问题 I know how to get bigram and trigram collocations using NLTK and I apply them to my own corpora. The code is below. My only problem is how to print out the birgram with the PMI value? I search NLTK documentation multiple times. It's either I'm missing something or it's not there. import nltk from nltk.collocations import * myFile = open("large.txt", 'r').read() myList = myFile.split() myCorpus = nltk.Text(myList) trigram_measures = nltk.collocations.TrigramAssocMeasures() finder =

Converting output of dependency parsing to tree

阅读更多关于 Converting output of dependency parsing to tree

问题 I am using Stanford dependency parser and the I get the following output of the sentence I shot an elephant in my sleep python dep_parsing.py [((u'shot', u'VBD'), u'nsubj', (u'I', u'PRP')), ((u'shot', u'VBD'), u'dobj', (u'elephant', u'NN')), ((u'elephant', u'NN'), u'det', (u'an', u'DT')), ((u'shot', u'VBD'), u'nmod', (u'sleep', u'NN')), ((u'sleep', u'NN'), u'case', (u'in', u'IN')), ((u'sleep', u'NN'), u'nmod:poss', (u'my', u'PRP$'))] I want to convert this into a graph with nodes being each

How to sum up the word count for each person in a dialogue?

阅读更多关于 How to sum up the word count for each person in a dialogue?

问题 I'm starting to learn Python and I'm trying to write a program that would import a text file, count the total number of words, count the number of words in a specific paragraph (said by each participant, described by 'P1', 'P2' etc.), exclude these words (i.e. 'P1' etc.) from my word count, and print paragraphs separately. Thanks to @James Hurford I got this code: words = None with open('data.txt') as f: words = f.read().split() total_words = len(words) print 'Total words:', total_words in

What is the meaning of “isolated symbol probabilities of English”

阅读更多关于 What is the meaning of “isolated symbol probabilities of English”

问题 In a note I found this phrase: Using isolated symbol probabilities of English language, you can find out the entropy of the language. What is actually meant by "isolated symbol probabilities"? This is related to the entropy of an information source. 回答1: It would be helpful to know where the note came from and what the context is, but even without that I am quite sure this simply means that they use the frequency of individual symbols (e.g. characters) as the basis for entropy, rather than

Retrieving verb stems from a list of verbs

阅读更多关于 Retrieving verb stems from a list of verbs

问题 I have a list of strings which are all verbs. I need to get the word frequencies for each verb, but I want to count verbs such as "want", "wants", "wanting" and "wanted" as one verb. Formally, a “verb” is defined as a set of 4 words that are of the form {X, Xs, Xed, Xing} or of the form {Xe, Xes, Xed, Xing}. How would I go about extracting verbs from the list such that I get "X" and a count of how many times the stem appears? I figured I could somehow use regex, but I'm a regex n00b and am

Identifying multiple categories and associated sentiment within text

阅读更多关于 Identifying multiple categories and associated sentiment within text

问题 If you have a corpus of text, how can you identify all the categories (from a list of pre-defined categories) and the associated sentiment (positive/negative writing) with it? I will be doing this in Python but at this stage I am not necessarily looking for a language specific solution. Let's look at this question with an example to try and clarify what I am asking. If I have a whole corpus of reviews for products e.g.: Microsoft's Xbox One offers impressive graphics and a solid list of