nlp

Duplicate elimination of similar company names

自古美人都是妖i 提交于 2020-01-15 03:28:14
问题 I have a table with company names. There are many duplicates because of human input errors. There are different perceptions if the subdivision should be included, typos, etc. I want all these duplicates to be marked as one company "1c": +------------------+ | company | +------------------+ | 1c | | 1c company | | 1c game studios | | 1c wireless | | 1c-avalon | | 1c-softclub | | 1c: maddox games | | 1c:inoco | | 1cc games | +------------------+ I identified Levenshtein distance as a good way

Using a support vector classifier with polynomial kernel in scikit-learn

半腔热情 提交于 2020-01-14 20:39:16
问题 I'm experimenting with different classifiers implemented in the scikit-learn package, to do some NLP task. The code I use to perform the classification is the following def train_classifier(self, argcands): # Extract the necessary features from the argument candidates train_argcands_feats = [] train_argcands_target = [] for argcand in argcands: train_argcands_feats.append(self.extract_features(argcand)) train_argcands_target.append(argcand["info"]["label"]) # Transform the features to the

Definition of the CESS_ESP tags

狂风中的少年 提交于 2020-01-14 14:31:53
问题 I'm using the NLTK CESS ESP data package and I've been able to use an adatpation of the spaghetti tagger and a HiddenMarkovModelTagger to pos-tag the sentence, how ever the tags that it produces are not at all like the ones used when tagging en_US sentences, here's a link to the Categorizing and Tagging documentation for NLTK, you'll notice that the tags used are uppercase and don't have any numbers or punctuation, some cess tags: vsip3s0 , da0fs0 . Does some one know a reference that

Sequence of vowels count

怎甘沉沦 提交于 2020-01-14 14:27:28
问题 This is not a homework question, it is an exam preparation question. I should define a function syllables(word) that counts the number of syllables in A word in the following way: • a maximal sequence of vowels is a syllable; • a final e in a word is not a syllable (or the vowel sequence it is a part Of). I do not have to deal with any special cases, such as a final e in a One-syllable word (e.g., ’be’ or ’bee’). >>> syllables(’honour’) 2 >>> syllables(’decode’) 2 >>> syllables(’oiseau’) 2

How can I create my own corpus in the Python Natural Language Toolkit? [duplicate]

巧了我就是萌 提交于 2020-01-14 13:40:32
问题 This question already has answers here : Creating a new corpus with NLTK (3 answers) Closed 6 years ago . I have recently expanded the names corpus in nltk and would like to know how I can turn the two files I have (male.txt, female.txt) in to a corpus so I can access them using the existing nltk.corpus methods. Does anyone have any suggestions? Many thanks, James. 回答1: As the readme says, the names corpus is not in the public domain -- you should send an email with any changes you make to

How to access BERT intermediate layer outputs in TF Hub Module?

假装没事ソ 提交于 2020-01-14 13:27:08
问题 Does anybody know a way to access the outputs of the intermediate layers from BERT's hosted models on Tensorflow Hub? The model is hosted here. I have explored the meta graph and found the only signatures available are "tokens, "tokenization_info", and "mlm". The first two are illustrated in the examples on github, and the masked language model signature doesn't help much. Some models like inception allow you to access all of the intermediate layers, but not this one. Right now, all I can

Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

烈酒焚心 提交于 2020-01-14 10:16:06
问题 I want to train a word2vec model on the english wikipedia using python with gensim. I closely followed https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw for that. It works for me but what I don't like about the resulting word2vec model is that named entities are split which makes the model unusable for my specific application. The model I need has to represent named entities as a single vector. Thats why I planned to parse the wikipedia articles with spacy and merge entities like

Python: Tokenizing with phrases

感情迁移 提交于 2020-01-14 07:55:10
问题 I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to be tokenized as a single token, instead of the regular tokenization. For example, given the sentence "The West Wing is an American television serial drama created by Aaron Sorkin that was originally broadcast on NBC from September 22, 1999 to May 14, 2006," and adding the phrase to the tokenizer "the

Is there any NLP tools for semantic parsing for languages other than English

不问归期 提交于 2020-01-14 07:09:28
问题 I want to parse Malayalam(Indian Language) text corpora for developing a question answering system.Is there any NLP tools for semantic parsing for languages other than English. 回答1: This might sound big and scary. As far as I know, there is no free software question/answering system you can study, even if it's documented. There is two part to question-answering: understanding the question looking up the response in some preprocessed dataset (say wikidata.org) Both steps require similar

CWB encoding Corpus

霸气de小男生 提交于 2020-01-14 03:48:18
问题 According to the Corpus Work Bench, to encode a corpus i need to use the cwb-encode perl script "encode the corpus, i.e. convert the verticalized text to CWB binary format with the cwb-encode tool. Note that the command below has to be entered on a single line." http://cogsci.uni-osnabrueck.de/~korpora/ws/CWBdoc/CWB_Encoding_Tutorial/node3.html $ cwb-encode -d /corpora/data/example -f example.vrt -R /usr/local/share/cwb/registry/example -P pos -S s when i tried it, it says the file is missing