nlp

After training word embedding with gensim's fasttext's wrapper, how to embed new sentences?

隐身守侯 提交于 2021-01-07 03:56:06
问题 After reading the tutorial at gensim's docs, I do not understand what is the correct way of generating new embeddings from a trained model. So far I have trained gensim's fast text embeddings like this: from gensim.models.fasttext import FastText as FT_gensim model_gensim = FT_gensim(size=100) # build the vocabulary model_gensim.build_vocab(corpus_file=corpus_file) # train the model model_gensim.train( corpus_file=corpus_file, epochs=model_gensim.epochs, total_examples=model_gensim.corpus

is there a method to select all categories MeSH with sparql

前提是你 提交于 2021-01-07 03:22:32
问题 i want to get data with sparql from Medical Subject Headings RDF i try to do this code : PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#> PREFIX mesh: <http://id.nlm.nih.gov/mesh/> PREFIX mesh2015: <http://id.nlm.nih.gov/mesh/2015/> PREFIX mesh2016: <http://id.nlm.nih.gov/mesh/2016/> PREFIX mesh2017: <http://id.nlm.nih.gov/mesh/2017/> SELECT DISTINCT ?descriptor ?label FROM <http://id.nlm.nih.gov

is there a method to select all categories MeSH with sparql

拈花ヽ惹草 提交于 2021-01-07 03:22:11
问题 i want to get data with sparql from Medical Subject Headings RDF i try to do this code : PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#> PREFIX mesh: <http://id.nlm.nih.gov/mesh/> PREFIX mesh2015: <http://id.nlm.nih.gov/mesh/2015/> PREFIX mesh2016: <http://id.nlm.nih.gov/mesh/2016/> PREFIX mesh2017: <http://id.nlm.nih.gov/mesh/2017/> SELECT DISTINCT ?descriptor ?label FROM <http://id.nlm.nih.gov

how can i modify language model before applying patterns

亡梦爱人 提交于 2021-01-07 02:49:59
问题 I have this code : from spacy.matcher import Matcher,PhraseMatcher import spacy from spacy.matcher import Matcher nlp = spacy.load("en_core_web_sm") matcher = Matcher(nlp.vocab,validate=True) patterns = [ [{'POS': 'QUALIF'}, {'POS': 'CCONJ'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}], ] matcher.add("process_1", None, *patterns) texts= ["it is a beautiful and big apple"] for text in texts: doc = nlp(text) matches = matcher(doc) for _, start, end in matches: print(doc[start:end].text) So, I want to

how can i modify language model before applying patterns

…衆ロ難τιáo~ 提交于 2021-01-07 02:49:55
问题 I have this code : from spacy.matcher import Matcher,PhraseMatcher import spacy from spacy.matcher import Matcher nlp = spacy.load("en_core_web_sm") matcher = Matcher(nlp.vocab,validate=True) patterns = [ [{'POS': 'QUALIF'}, {'POS': 'CCONJ'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}], ] matcher.add("process_1", None, *patterns) texts= ["it is a beautiful and big apple"] for text in texts: doc = nlp(text) matches = matcher(doc) for _, start, end in matches: print(doc[start:end].text) So, I want to

What is the reliable way to convert text data (document) to numerical data (vector) and save it for further use?

落爺英雄遲暮 提交于 2021-01-07 02:44:58
问题 As we know machines can't understand the text but it understands numbers so in NLP we convert text to some numeric representation and one of them is BOW representation. Here, my objective is to convert every document to some numeric representation and save it for future use. And I am following the below way to do that by converting text to BOW and saving it in a pickle file. My question is, whether we can do this in a better and reliable way? so that every document can be saved as some vector

Is it possible to find uncertainties of spaCy POS tags?

眉间皱痕 提交于 2021-01-05 09:01:16
问题 I am trying to build a non-English spell checker that relies on classification of sentences by spaCy, which allows my algorithm to then use the POS tags and the grammatical dependencies of the individual tokens to determine incorrect spelling (in my case more specifically: incorrect splits in Dutch compound words). However, spaCy appears to classify sentences incorrectly if they contain grammatical errors, for example classifying a noun as a verb, even though the classified word doesn't even

Problems with gensim WikiCorpus - aliasing chunkize to chunkize_serial; (__mp_main__ instead of __main__?)

一曲冷凌霜 提交于 2021-01-05 06:48:32
问题 I'm quite new to Python and coding in general, so I seem to have run into an issue. I'm trying to run this code (credit to Matthew Mayo, whole thing can be found here): # import warnings # warnings.filterwarnings(action = 'ignore', category = UserWarning, module = 'gensim') import sys from gensim.corpora import WikiCorpus def make_corpus (in_f, out_f): print(0) output = open(out_f, 'w', encoding = 'utf-8') print(1) wiki = WikiCorpus(in_f) print(2) i = 0 for text in wiki.get_texts(): output

Sliding window for long text in BERT for Question Answering

岁酱吖の 提交于 2021-01-05 00:51:51
问题 I've read post which explains how the sliding window works but I cannot find any information on how it is actually implemented. From what I understand if the input are too long, sliding window can be used to process the text. Please correct me if I am wrong. Say I have a text "In June 2017 Kaggle announced that it passed 1 million registered users" . Given some stride and max_len , the input can be split into chunks with over lapping words (not considering padding). In June 2017 Kaggle

Sliding window for long text in BERT for Question Answering

浪尽此生 提交于 2021-01-05 00:27:26
问题 I've read post which explains how the sliding window works but I cannot find any information on how it is actually implemented. From what I understand if the input are too long, sliding window can be used to process the text. Please correct me if I am wrong. Say I have a text "In June 2017 Kaggle announced that it passed 1 million registered users" . Given some stride and max_len , the input can be split into chunks with over lapping words (not considering padding). In June 2017 Kaggle