nlp

How to ignore punctuation in-between words using word_tokenize in NLTK?

和自甴很熟 提交于 2021-01-04 06:41:40
问题 I'm looking to ignore characters in-between words using NLTK word_tokenize. If I have a a sentence: test = 'Should I trade on the S&P? This works with a phone number 333-445-6635 and email test@testing.com' The word_tokenize method is splitting the S&P into 'S','&','P','?' Is there a way to have this library ignore punctuation between words or letters? Expected output: 'S&P','?' 回答1: Let me know how this works with your sentences. I added an additional test with a bunch of punctuation. The

Tfidfvectorizer - How can I check out processed tokens?

♀尐吖头ヾ 提交于 2021-01-04 05:40:43
问题 How can I check the strings tokenized inside TfidfVertorizer() ? If I don't pass anything in the arguments, TfidfVertorizer() will tokenize the string with some pre-defined methods. I want to observe how it tokenizes strings so that I can more easily tune my model. from sklearn.feature_extraction.text import TfidfVectorizer corpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?'] vectorizer = TfidfVectorizer

Wordpiece tokenization versus conventional lemmatization?

蹲街弑〆低调 提交于 2021-01-02 06:28:10
问题 I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for example, "playing" -> "play" + "##ing"). Right now, I have my text preprocessed using a standard tokenizer that splits on spaces / some punctuation, and then I have a lemmatizer ("playing" ->"play"). I'm wondering what the benefit of WordPiece

Repeating entity in replacing entity with their entity label using spacy

南楼画角 提交于 2021-01-01 09:26:08
问题 Code: import spacy nlp = spacy.load("en_core_web_md") #read txt file, each string on its own line with open("./try.txt","r") as f: texts = f.read().splitlines() #substitute entities with their TAGS docs = nlp.pipe(texts) out = [] for doc in docs: out_ = "" for tok in doc: text = tok.text if tok.ent_type_: text = tok.ent_type_ out_ += text + tok.whitespace_ out.append(out_) # write to file with open("./out_try.txt","w") as f: f.write("\n".join(out)) Contents of input file: Georgia recently

Using Gensim Fasttext model with LSTM nn in keras

只谈情不闲聊 提交于 2020-12-31 14:52:51
问题 I have trained fasttext model with Gensim over the corpus of very short sentences (up to 10 words). I know that my test set includes words that are not in my train corpus, i.e some of the words in my corpus are like "Oxytocin" "Lexitocin", "Ematrophin",'Betaxitocin" given a new word in the test set, fasttext knows pretty well to generate a vector with high cosine-similarity to the other similar words in the train set by using the characters level n-gram How do i incorporate the fasttext model

Using Gensim Fasttext model with LSTM nn in keras

半腔热情 提交于 2020-12-31 14:47:54
问题 I have trained fasttext model with Gensim over the corpus of very short sentences (up to 10 words). I know that my test set includes words that are not in my train corpus, i.e some of the words in my corpus are like "Oxytocin" "Lexitocin", "Ematrophin",'Betaxitocin" given a new word in the test set, fasttext knows pretty well to generate a vector with high cosine-similarity to the other similar words in the train set by using the characters level n-gram How do i incorporate the fasttext model

CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

爷,独闯天下 提交于 2020-12-30 06:12:46
问题 I got the following error when I ran my pytorch deep learning model in colab /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in linear(input, weight, bias) 1370 ret = torch.addmm(bias, input, weight.t()) 1371 else: -> 1372 output = input.matmul(weight.t()) 1373 if bias is not None: 1374 output += bias RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)` I even reduced batch size from 128 to 64 i.e., reduced to half, but still, I got this error

WordNet - What does n and the number represent?

半世苍凉 提交于 2020-12-29 13:14:21
问题 My question is related to WordNet Interface. >>> wn.synsets('cat') [Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')] >>> I could not find the answer to what is the purpose of n and the following number in cat.n.01 or caterpillar.n.02 . 回答1: Per the NLTK docs, a <lemma>.<pos>.<number> Synset string

WordNet - What does n and the number represent?

送分小仙女□ 提交于 2020-12-29 13:13:32
问题 My question is related to WordNet Interface. >>> wn.synsets('cat') [Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')] >>> I could not find the answer to what is the purpose of n and the following number in cat.n.01 or caterpillar.n.02 . 回答1: Per the NLTK docs, a <lemma>.<pos>.<number> Synset string

WordNet - What does n and the number represent?

给你一囗甜甜゛ 提交于 2020-12-29 13:09:34
问题 My question is related to WordNet Interface. >>> wn.synsets('cat') [Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')] >>> I could not find the answer to what is the purpose of n and the following number in cat.n.01 or caterpillar.n.02 . 回答1: Per the NLTK docs, a <lemma>.<pos>.<number> Synset string