nlp | 易学教程

New to NLP, Question about annotation

阅读更多关于 New to NLP, Question about annotation

问题 I am new to NLP and I am looking for a starting point, in terms of some tutorials, documentation or example code. I have been told to research the possibilities of processing natural text to extract some structured data from it. For example I want to extract(annotate) height and weight from following statements. "He is 6 feet tall and weighs 200 pounds" or "His height is 6 feet and weight is 200" etc. I have looked into UIMA but it seems like a self created REGEX dictionary with no training

Porter Stemming of fried

阅读更多关于 Porter Stemming of fried

问题 Why does the porter stemming algorithm online at http://text-processing.com/demo/stem/ stem fried to fri and not fry ? I can't recall any words ending with ied past tense in English that have a nominative form ending with i . Is this a bug? 回答1: A stem as returned by Porter Stemmer is not necessarily the base form of a verb, or a valid word at all. If you're looking for that, you need to look for a lemmatizer instead. 回答2: Firstly, a stemmer is not a lemmatizer, see also Stemmers vs

Why do CoreNLP ner tagger and ner tagger join the separated numbers together?

阅读更多关于 Why do CoreNLP ner tagger and ner tagger join the separated numbers together?

问题 Here is the code snippet: In [390]: t Out[390]: ['my', 'phone', 'number', 'is', '1111', '1111', '1111'] In [391]: ner_tagger.tag(t) Out[391]: [('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111\xa01111\xa01111', 'NUMBER')] What I expect is: Out[391]: [('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111', 'NUMBER'), ('1111', 'NUMBER'), ('1111', 'NUMBER')] As you can see the artificial phone number is joined by \xa0 which is said to be a non-breaking space. Can I

Is this handling of ambiguities in dypgen normal or is it not?

阅读更多关于 Is this handling of ambiguities in dypgen normal or is it not?

问题 I would like to know, if this is a bug or behavior, that is intended by the inventor. Here I have a minimal example of a dypgen grammar: { open Parse_tree let dyp_merge = Dyp.keep_all } %start main %layout [' ' '\t'] %% main: | a "\n" { $1 } a: | ms b { Mt ($1,$2) } | b <Mt(_,_)> kon1 b { Koo ($1, $2, $3) } | b <Mt(_,_)> kon2 b { Koo ($1, $2, $3) } | b { $1 } b: | k { $1 } | ns b { Nt ($1,$2) } /* If you comment this line out, it will work with the permutation, but I need the 'n' ! */ /* | b

Setting max Length for Sentence in StanfordCoreNLP

阅读更多关于 Setting max Length for Sentence in StanfordCoreNLP

问题 I am trying to restrict the max length for a sentence in StanfordCoreNLP. For some reason it does not seem to honor this property. This flag is part of the LexicalizedParser. But I am using StanfordCoreNLP instance in my class. Wondering what is the right way to set this flag. Properties properties = new Properties(); properties.put("annotators", "tokenize,ssplit,pos,lemma,ner"); properties.put("-maxLength", "100"); // does not work StanfordCoreNLP nap = new StanfordCoreNLP(properties); 回答1:

Stanford NER: AbstractSequenceClassifier vs NamedEntityTagAnnotation

阅读更多关于 Stanford NER: AbstractSequenceClassifier vs NamedEntityTagAnnotation

问题 QUESTIONS How do I load a custom properties file using AbstractSequenceClassifier? e.g., Master's Degree\tDEGREE MBA\tDEGREE What are the benefits/drawbacks of each approach?(AbstractSequenceClassifier vs NamedEntityTagAnnotation) Is there any accessible documentation/tutorial on the internet. I can play with demo code and read javadocs, but a good tutorial would save me and many others a lot of time. During my perusal of the Stanford NER documentation, I have encountered two java examples.

why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?

阅读更多关于 why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?

问题 In word2vec, after training, we get two weight matrixes:1.input-hidden weight matrix; 2.hidden-output weight matrix. and people will use the input-hidden weight matrix as the word vectors(each row corresponds to a word, namely, the word vectors).Here comes to my confusions: why people use input-hidden weight matrix as the word vectors instead of the hidden-output weight matrix. why don't we just add softmax activation function to the hidden layers rather than output layers, thus preventing

Creating vector space

阅读更多关于 Creating vector space

问题 I've got a question: I have a lot of documents and each line built by some pattern. Of course, I have this array of patterns. I want to create some vector space, then to vector this patterns by some rule (I have no ideas about what is this rule yet..) - i.e. to make this patterns like "centroids" of my vector space. Then to vector each line of the current document (again by this rule) and to count the closet centroid to this line (i.e. minimum of the distance between two vectors). I don't

Need approach on building Custom NER for extracting below keywords from any format of payslips

阅读更多关于 Need approach on building Custom NER for extracting below keywords from any format of payslips

问题 I am trying to build a generic extraction of below parameters from any format of payslip: Name His PostCode Pay Date Net Pay. Challenge I am facing is due to variety of format that may come, I want to apply NER (Spacy) to learn these under the entities Name - PERSON His PostCode Pay Date - DATE Net Pay. - MONEY But I am unsuccess so far, I even tried to build a custom EntityMatcher for Postcode & Date but to no success. I seek any guideline and approach to make me take the right path in

Is there a proper installation guide for Giza++ on Ubuntu?

阅读更多关于 Is there a proper installation guide for Giza++ on Ubuntu?

问题 I see proper installation guide available for Giza, but not for Giza++. The instructions for installing the former (as found here http://giza.sourceforge.net/documentation/installation.html) is obviously not working on the latter. I am using Ubuntu 12.04. 回答1: TL;DR sudo apt-get install build-essential git-core pkg-config automake libtool wget zlib1g-dev python-dev libbz2-dev git clone https://github.com/moses-smt/mosesdecoder.git cd mosesdecoder make -f contrib/Makefiles/install-dependencies