nlp | 易学教程

How to ignore punctuation in-between words using word_tokenize in NLTK?

阅读更多关于 How to ignore punctuation in-between words using word_tokenize in NLTK?

问题 I'm looking to ignore characters in-between words using NLTK word_tokenize. If I have a a sentence: test = 'Should I trade on the S&P? This works with a phone number 333-445-6635 and email test@testing.com' The word_tokenize method is splitting the S&P into 'S','&','P','?' Is there a way to have this library ignore punctuation between words or letters? Expected output: 'S&P','?' 回答1: Let me know how this works with your sentences. I added an additional test with a bunch of punctuation. The

Tfidfvectorizer - How can I check out processed tokens?

阅读更多关于 Tfidfvectorizer - How can I check out processed tokens?

问题 How can I check the strings tokenized inside TfidfVertorizer() ? If I don't pass anything in the arguments, TfidfVertorizer() will tokenize the string with some pre-defined methods. I want to observe how it tokenizes strings so that I can more easily tune my model. from sklearn.feature_extraction.text import TfidfVectorizer corpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?'] vectorizer = TfidfVectorizer

Wordpiece tokenization versus conventional lemmatization?

阅读更多关于 Wordpiece tokenization versus conventional lemmatization?

问题 I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for example, "playing" -> "play" + "##ing"). Right now, I have my text preprocessed using a standard tokenizer that splits on spaces / some punctuation, and then I have a lemmatizer ("playing" ->"play"). I'm wondering what the benefit of WordPiece

Repeating entity in replacing entity with their entity label using spacy

阅读更多关于 Repeating entity in replacing entity with their entity label using spacy

问题 Code: import spacy nlp = spacy.load("en_core_web_md") #read txt file, each string on its own line with open("./try.txt","r") as f: texts = f.read().splitlines() #substitute entities with their TAGS docs = nlp.pipe(texts) out = [] for doc in docs: out_ = "" for tok in doc: text = tok.text if tok.ent_type_: text = tok.ent_type_ out_ += text + tok.whitespace_ out.append(out_) # write to file with open("./out_try.txt","w") as f: f.write("\n".join(out)) Contents of input file: Georgia recently

Using Gensim Fasttext model with LSTM nn in keras

阅读更多关于 Using Gensim Fasttext model with LSTM nn in keras

问题 I have trained fasttext model with Gensim over the corpus of very short sentences (up to 10 words). I know that my test set includes words that are not in my train corpus, i.e some of the words in my corpus are like "Oxytocin" "Lexitocin", "Ematrophin",'Betaxitocin" given a new word in the test set, fasttext knows pretty well to generate a vector with high cosine-similarity to the other similar words in the train set by using the characters level n-gram How do i incorporate the fasttext model

Using Gensim Fasttext model with LSTM nn in keras

阅读更多关于 Using Gensim Fasttext model with LSTM nn in keras

CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

阅读更多关于 CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

问题 I got the following error when I ran my pytorch deep learning model in colab /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in linear(input, weight, bias) 1370 ret = torch.addmm(bias, input, weight.t()) 1371 else: -> 1372 output = input.matmul(weight.t()) 1373 if bias is not None: 1374 output += bias RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)` I even reduced batch size from 128 to 64 i.e., reduced to half, but still, I got this error

WordNet - What does n and the number represent?

阅读更多关于 WordNet - What does n and the number represent?

问题 My question is related to WordNet Interface. >>> wn.synsets('cat') [Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')] >>> I could not find the answer to what is the purpose of n and the following number in cat.n.01 or caterpillar.n.02 . 回答1: Per the NLTK docs, a <lemma>.<pos>.<number> Synset string

WordNet - What does n and the number represent?

阅读更多关于 WordNet - What does n and the number represent?

WordNet - What does n and the number represent?

阅读更多关于 WordNet - What does n and the number represent?