nlp

New to NLP, Question about annotation

纵然是瞬间 提交于 2019-12-24 02:48:11
问题 I am new to NLP and I am looking for a starting point, in terms of some tutorials, documentation or example code. I have been told to research the possibilities of processing natural text to extract some structured data from it. For example I want to extract(annotate) height and weight from following statements. "He is 6 feet tall and weighs 200 pounds" or "His height is 6 feet and weight is 200" etc. I have looked into UIMA but it seems like a self created REGEX dictionary with no training

Porter Stemming of fried

送分小仙女□ 提交于 2019-12-24 02:23:03
问题 Why does the porter stemming algorithm online at http://text-processing.com/demo/stem/ stem fried to fri and not fry ? I can't recall any words ending with ied past tense in English that have a nominative form ending with i . Is this a bug? 回答1: A stem as returned by Porter Stemmer is not necessarily the base form of a verb, or a valid word at all. If you're looking for that, you need to look for a lemmatizer instead. 回答2: Firstly, a stemmer is not a lemmatizer, see also Stemmers vs

Why do CoreNLP ner tagger and ner tagger join the separated numbers together?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-24 01:43:06
问题 Here is the code snippet: In [390]: t Out[390]: ['my', 'phone', 'number', 'is', '1111', '1111', '1111'] In [391]: ner_tagger.tag(t) Out[391]: [('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111\xa01111\xa01111', 'NUMBER')] What I expect is: Out[391]: [('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111', 'NUMBER'), ('1111', 'NUMBER'), ('1111', 'NUMBER')] As you can see the artificial phone number is joined by \xa0 which is said to be a non-breaking space. Can I

Is this handling of ambiguities in dypgen normal or is it not?

风流意气都作罢 提交于 2019-12-24 01:26:29
问题 I would like to know, if this is a bug or behavior, that is intended by the inventor. Here I have a minimal example of a dypgen grammar: { open Parse_tree let dyp_merge = Dyp.keep_all } %start main %layout [' ' '\t'] %% main: | a "\n" { $1 } a: | ms b { Mt ($1,$2) } | b <Mt(_,_)> kon1 b { Koo ($1, $2, $3) } | b <Mt(_,_)> kon2 b { Koo ($1, $2, $3) } | b { $1 } b: | k { $1 } | ns b { Nt ($1,$2) } /* If you comment this line out, it will work with the permutation, but I need the 'n' ! */ /* | b

Setting max Length for Sentence in StanfordCoreNLP

我与影子孤独终老i 提交于 2019-12-24 01:23:57
问题 I am trying to restrict the max length for a sentence in StanfordCoreNLP. For some reason it does not seem to honor this property. This flag is part of the LexicalizedParser. But I am using StanfordCoreNLP instance in my class. Wondering what is the right way to set this flag. Properties properties = new Properties(); properties.put("annotators", "tokenize,ssplit,pos,lemma,ner"); properties.put("-maxLength", "100"); // does not work StanfordCoreNLP nap = new StanfordCoreNLP(properties); 回答1:

Stanford NER: AbstractSequenceClassifier vs NamedEntityTagAnnotation

我的未来我决定 提交于 2019-12-24 00:49:55
问题 QUESTIONS How do I load a custom properties file using AbstractSequenceClassifier? e.g., Master's Degree\tDEGREE MBA\tDEGREE What are the benefits/drawbacks of each approach?(AbstractSequenceClassifier vs NamedEntityTagAnnotation) Is there any accessible documentation/tutorial on the internet. I can play with demo code and read javadocs, but a good tutorial would save me and many others a lot of time. During my perusal of the Stanford NER documentation, I have encountered two java examples.

why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?

六月ゝ 毕业季﹏ 提交于 2019-12-24 00:48:54
问题 In word2vec, after training, we get two weight matrixes:1.input-hidden weight matrix; 2.hidden-output weight matrix. and people will use the input-hidden weight matrix as the word vectors(each row corresponds to a word, namely, the word vectors).Here comes to my confusions: why people use input-hidden weight matrix as the word vectors instead of the hidden-output weight matrix. why don't we just add softmax activation function to the hidden layers rather than output layers, thus preventing

Creating vector space

…衆ロ難τιáo~ 提交于 2019-12-24 00:46:35
问题 I've got a question: I have a lot of documents and each line built by some pattern. Of course, I have this array of patterns. I want to create some vector space, then to vector this patterns by some rule (I have no ideas about what is this rule yet..) - i.e. to make this patterns like "centroids" of my vector space. Then to vector each line of the current document (again by this rule) and to count the closet centroid to this line (i.e. minimum of the distance between two vectors). I don't

Need approach on building Custom NER for extracting below keywords from any format of payslips

痞子三分冷 提交于 2019-12-24 00:38:21
问题 I am trying to build a generic extraction of below parameters from any format of payslip: Name His PostCode Pay Date Net Pay. Challenge I am facing is due to variety of format that may come, I want to apply NER (Spacy) to learn these under the entities Name - PERSON His PostCode Pay Date - DATE Net Pay. - MONEY But I am unsuccess so far, I even tried to build a custom EntityMatcher for Postcode & Date but to no success. I seek any guideline and approach to make me take the right path in

Is there a proper installation guide for Giza++ on Ubuntu?

五迷三道 提交于 2019-12-23 23:48:00
问题 I see proper installation guide available for Giza, but not for Giza++. The instructions for installing the former (as found here http://giza.sourceforge.net/documentation/installation.html) is obviously not working on the latter. I am using Ubuntu 12.04. 回答1: TL;DR sudo apt-get install build-essential git-core pkg-config automake libtool wget zlib1g-dev python-dev libbz2-dev git clone https://github.com/moses-smt/mosesdecoder.git cd mosesdecoder make -f contrib/Makefiles/install-dependencies