nlp

Percentage Count Verb, Noun using Spacy?

泪湿孤枕 提交于 2019-12-11 17:08:47
问题 I want to count percentage split of POS in a sentence using spacy, similiar to Count verbs, nouns, and other parts of speech with python's NLTK Currently able to detect and count POS. How to find percentage split. from __future__ import unicode_literals import spacy,en_core_web_sm from collections import Counter nlp = en_core_web_sm.load() print Counter(([token.pos_ for token in nlp('The cat sat on the mat.')])) Current output: Counter({u'NOUN': 2, u'DET': 2, u'VERB': 1, u'ADP': 1, u'PUNCT':

Issues in Gensim WordRank Embeddings

混江龙づ霸主 提交于 2019-12-11 16:59:37
问题 I am using Gensim wrapper to obtain wordRank embeddings (I am following their tutorial to do this) as follows. from gensim.models.wrappers import Wordrank model = Wordrank.train(wr_path = "models", corpus_file="proc_brown_corp.txt", out_name= "wr_model") model.save("wordrank") model.save_word2vec_format("wordrank_in_word2vec.vec") However, I am getting the following error FileNotFoundError: [WinError 2] The system cannot find the file specified . I am just wondering what I have made wrong as

Spacy LIKE_NUM cast to it's python number equivalent

梦想的初衷 提交于 2019-12-11 16:28:14
问题 Does spacy provide a quick conversion from LIKE_NUM token to a python float, decimal. Spacy can match a LIKE_NUM token like “31,2”, “10.9”, “10”, “ten”, etc. Does it provide a quick way to get a python number as well? I was expecting a method like .get_value() to return me the number (not the string), but I couldn't find any. nlp = spacy.load('en_core_web_sm') matcher = Matcher(nlp.vocab) text = "this is just a text and a number 10,2 or 10.2 meaning ten point two" doc = nlp(text) pattern = [{

How to get a parse in a bracketed format (without POS tags)?

非 Y 不嫁゛ 提交于 2019-12-11 16:24:05
问题 I want to parse a sentence to a binary parse of this form (Format used in the SNLI corpus): sentence:"A person on a horse jumps over a broken down airplane." parse: ( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) ) I'm unable to find a parser which does this. note: This question has been asked earlier(How to get a binary parse in Python). But the answers are not helpful. And I was unable to comment because I do not have the required reputation

What exactly does target_vocab_size mean in the method tfds.features.text.SubwordTextEncoder.build_from_corpus?

帅比萌擦擦* 提交于 2019-12-11 16:06:19
问题 According to this link, target_vocab_size: int, approximate size of the vocabulary to create. The statement is pretty ambiguous for me. As far as I can understand, the encoder will map each vocabulary to a unique ID. What will happen if the corpus has vocab_size larger than the target_vocab_size ? 回答1: The documentation says: Encoding is fully invertible because all out-of-vocab wordpieces are byte-encoded Which means unknown word pieces will be encoded one character at a time. It's best

How to compare meaningful level of a set of phrase that describe same concept in NLP?

血红的双手。 提交于 2019-12-11 16:06:08
问题 I have two terms "vehicle" and "motor vehicle". Are there any way to compare the meaningfulness level or ambiguity level of these two in NLP? The outcome should be that "motor vehicle" is more meaningful than "vehicle" or "vehicle" is more ambiguous than "motor vehicle". Thanks 回答1: The question you ask is very broad (see here) but here are some hints for starting points: WordNet Word embeddings: word2vec, Glove The meaning difference you are looking into is quite peculiar so I suggest

Regex [A-Z] Do Not Recognize Local Characters

穿精又带淫゛_ 提交于 2019-12-11 16:05:36
问题 I've checked other problems and I've read their solutions, they do not work. I've tested the regular expression it works on non-locale characters. Code is simply to find any capital letters in a string and doing some procedure on them. Such as minikŞeker bir kedi would return kŞe however my code do not recognize Ş as a letter within [A-Z] . When I try re.LOCALE as some people request I get error ValueError: cannot use LOCALE flag with a str pattern when I use re.UNICODE import re corp =

What is currently the best way to add a custom dictionary to a neural machine translator that uses the transformer architecture?

依然范特西╮ 提交于 2019-12-11 15:56:32
问题 It's common to add a custom dictionary to a machine translator to ensure that terminology from a specific domain is correctly translated. For example, the term server should be translated differently when the document is about data centers, vs when the document is about restaurants. With a transformer model, this is not very obvious to do, since words are not aligned 1:1. I've seen a couple of papers on this topic, but I'm not sure which would be the best one to use. What are the best

Co occurance matrix for tfidf vectorizer for top 2000 words

余生长醉 提交于 2019-12-11 15:49:47
问题 i computed tfidf vectorizer for text data and got vectors as (100000,2000) max_feature = 2000. while i am computing the co occurance matrix by below code. length = 2000 m = np.zeros([length,length]) # n is the count of all words def cal_occ(sentence,m): for i,word in enumerate(sentence): print(i) print(word) for j in range(max(i-window,0),min(i+window,length)): print(j) print(sentence[j]) m[word,sentence[j]]+=1 for sentence in tf_vec: cal_occ(sentence, m) I am getting the following error. 0

SolrCloud OpenNLP error Can't find resource 'opennlp/en-sent.bin' in classpath or '/configs/_default'

梦想的初衷 提交于 2019-12-11 15:41:08
问题 I have error when using Apache OpenNLP with Solr (ver. 7.3.0) in Cloud mode. When I add field type to managed-schema using open nlp like this: <fieldType name="text_opennlp" class="solr.TextField"> <analyzer> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="opennlp/en-sent.bin" tokenizerModel="opennlp/en-token.bin" /> </analyzer> </fieldType> <field name="content" type="text_opennlp" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true"