nlp | 易学教程

Duplicate elimination of similar company names

阅读更多关于 Duplicate elimination of similar company names

Using a support vector classifier with polynomial kernel in scikit-learn

阅读更多关于 Using a support vector classifier with polynomial kernel in scikit-learn

问题 I'm experimenting with different classifiers implemented in the scikit-learn package, to do some NLP task. The code I use to perform the classification is the following def train_classifier(self, argcands): # Extract the necessary features from the argument candidates train_argcands_feats = [] train_argcands_target = [] for argcand in argcands: train_argcands_feats.append(self.extract_features(argcand)) train_argcands_target.append(argcand["info"]["label"]) # Transform the features to the

Definition of the CESS_ESP tags

阅读更多关于 Definition of the CESS_ESP tags

问题 I'm using the NLTK CESS ESP data package and I've been able to use an adatpation of the spaghetti tagger and a HiddenMarkovModelTagger to pos-tag the sentence, how ever the tags that it produces are not at all like the ones used when tagging en_US sentences, here's a link to the Categorizing and Tagging documentation for NLTK, you'll notice that the tags used are uppercase and don't have any numbers or punctuation, some cess tags: vsip3s0 , da0fs0 . Does some one know a reference that

Sequence of vowels count

阅读更多关于 Sequence of vowels count

问题 This is not a homework question, it is an exam preparation question. I should deﬁne a function syllables(word) that counts the number of syllables in A word in the following way: • a maximal sequence of vowels is a syllable; • a ﬁnal e in a word is not a syllable (or the vowel sequence it is a part Of). I do not have to deal with any special cases, such as a ﬁnal e in a One-syllable word (e.g., ’be’ or ’bee’). >>> syllables(’honour’) 2 >>> syllables(’decode’) 2 >>> syllables(’oiseau’) 2

How can I create my own corpus in the Python Natural Language Toolkit? [duplicate]

阅读更多关于 How can I create my own corpus in the Python Natural Language Toolkit? [duplicate]

问题 This question already has answers here : Creating a new corpus with NLTK (3 answers) Closed 6 years ago . I have recently expanded the names corpus in nltk and would like to know how I can turn the two files I have (male.txt, female.txt) in to a corpus so I can access them using the existing nltk.corpus methods. Does anyone have any suggestions? Many thanks, James. 回答1: As the readme says, the names corpus is not in the public domain -- you should send an email with any changes you make to

How to access BERT intermediate layer outputs in TF Hub Module?

阅读更多关于 How to access BERT intermediate layer outputs in TF Hub Module?

问题 Does anybody know a way to access the outputs of the intermediate layers from BERT's hosted models on Tensorflow Hub? The model is hosted here. I have explored the meta graph and found the only signatures available are "tokens, "tokenization_info", and "mlm". The first two are illustrated in the examples on github, and the masked language model signature doesn't help much. Some models like inception allow you to access all of the intermediate layers, but not this one. Right now, all I can

Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

阅读更多关于 Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

问题 I want to train a word2vec model on the english wikipedia using python with gensim. I closely followed https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw for that. It works for me but what I don't like about the resulting word2vec model is that named entities are split which makes the model unusable for my specific application. The model I need has to represent named entities as a single vector. Thats why I planned to parse the wikipedia articles with spacy and merge entities like

Python: Tokenizing with phrases

阅读更多关于 Python: Tokenizing with phrases

问题 I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to be tokenized as a single token, instead of the regular tokenization. For example, given the sentence "The West Wing is an American television serial drama created by Aaron Sorkin that was originally broadcast on NBC from September 22, 1999 to May 14, 2006," and adding the phrase to the tokenizer "the

Is there any NLP tools for semantic parsing for languages other than English

阅读更多关于 Is there any NLP tools for semantic parsing for languages other than English

问题 I want to parse Malayalam(Indian Language) text corpora for developing a question answering system.Is there any NLP tools for semantic parsing for languages other than English. 回答1: This might sound big and scary. As far as I know, there is no free software question/answering system you can study, even if it's documented. There is two part to question-answering: understanding the question looking up the response in some preprocessed dataset (say wikidata.org) Both steps require similar

CWB encoding Corpus

阅读更多关于 CWB encoding Corpus

问题 According to the Corpus Work Bench, to encode a corpus i need to use the cwb-encode perl script "encode the corpus, i.e. convert the verticalized text to CWB binary format with the cwb-encode tool. Note that the command below has to be entered on a single line." http://cogsci.uni-osnabrueck.de/~korpora/ws/CWBdoc/CWB_Encoding_Tutorial/node3.html $ cwb-encode -d /corpora/data/example -f example.vrt -R /usr/local/share/cwb/registry/example -P pos -S s when i tried it, it says the file is missing