spacy

Trying to install anything with pip on macos and cannot

非 Y 不嫁゛ 提交于 2019-12-02 21:48:02
问题 I think I have a problem with my mac os system. Everything I try to install on it using pip I get the same errors over and over again, I have pasted just the lines that display an error to not overcrowd this thread: Collecting murmurhash3 Using cached https://files.pythonhosted.org/packages/b5/f4/1f9c4851667a2541bd151b8d9efef707495816274fada365fa6a31085a32/murmurhash3-2.3.5.tar.gz Building wheels for collected packages: murmurhash3 Running setup.py bdist_wheel for murmurhash3 ... error

SpaCy: how to load Google news word2vec vectors?

微笑、不失礼 提交于 2019-12-02 19:12:25
I've tried several methods of loading the google news word2vec vectors ( https://code.google.com/archive/p/word2vec/ ): en_nlp = spacy.load('en',vector=False) en_nlp.vocab.load_vectors_from_bin_loc('GoogleNews-vectors-negative300.bin') The above gives: MemoryError: Error assigning 18446744072820359357 bytes I've also tried with the .gz packed vectors; or by loading and saving them with gensim to a new format: from gensim.models.word2vec import Word2Vec model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) model.save_word2vec_format('googlenews2.txt') This

Is it possible to change the token split rules for a Spacy tokenizer?

ε祈祈猫儿з 提交于 2019-12-02 18:46:28
问题 The (German) spacy tokenizer does not split on slashes, underscores, or asterisks by default, which is just what I need (so "der/die" results in a single token). However it does split on parentheses so "dies(und)das" gets split into 5 tokens. Is there a (simple) way to tell the default tokeniser to also not split on parentheses which are enclosed by letters on both sides without a space? How exactly are those splits on parentheses defined for a tokenizer? 回答1: The split on parentheses is

What do spaCy's part-of-speech and dependency tags mean?

≯℡__Kan透↙ 提交于 2019-12-02 16:38:14
spaCy tags up each of the Token s in a Document with a part of speech (in two different formats, one stored in the pos and pos_ properties of the Token and the other stored in the tag and tag_ properties) and a syntactic dependency to its .head token (stored in the dep and dep_ properties). Some of these tags are self-explanatory, even to somebody like me without a linguistics background: >>> import spacy >>> en_nlp = spacy.load('en') >>> document = en_nlp("I shot a man in Reno just to watch him die.") >>> document[1] shot >>> document[1].pos_ 'VERB' Others... are not: >>> document[1].tag_

spaCy - Tokenization of Hyphenated words

青春壹個敷衍的年華 提交于 2019-12-02 11:13:08
Good day SO, I am trying to post-process hyphenated words that are tokenized into separate tokens when they were supposedly a single token. For example: Example: Sentence: "up-scaled" Tokens: ['up', '-', 'scaled'] Expected: ['up-scaled'] For now, my solution is to use the matcher: matcher = Matcher(nlp.vocab) pattern = [{'IS_ALPHA': True, 'IS_SPACE': False}, {'ORTH': '-'}, {'IS_ALPHA': True, 'IS_SPACE': False}] matcher.add('HYPHENATED', None, pattern) def quote_merger(doc): # this will be called on the Doc object in the pipeline matched_spans = [] matches = matcher(doc) for match_id, start,

Spacy, matcher with entities spanning more than a single token

萝らか妹 提交于 2019-12-02 08:35:27
问题 I am trying to create a matcher that finds negated custom entities in the text. It is working fine for entities that span a single token, but I am having trouble trying to capture entities that span more than one token. As an example, let's say that my custom entities are animals (and are labeled as token.ent_type_ = "animal" ) ["cat", "dog", "artic fox"] (note that the last entity has two words). Now I want to find those entities in the text but negated, so I can create a simple matcher with

Trying to install anything with pip on macos and cannot

五迷三道 提交于 2019-12-02 08:30:17
I think I have a problem with my mac os system. Everything I try to install on it using pip I get the same errors over and over again, I have pasted just the lines that display an error to not overcrowd this thread: Collecting murmurhash3 Using cached https://files.pythonhosted.org/packages/b5/f4/1f9c4851667a2541bd151b8d9efef707495816274fada365fa6a31085a32/murmurhash3-2.3.5.tar.gz Building wheels for collected packages: murmurhash3 Running setup.py bdist_wheel for murmurhash3 ... error Complete output from command /usr/local/opt/python/bin/python3.7 -u -c "import setuptools, tokenize;__file__=

Spacy, matcher with entities spanning more than a single token

旧巷老猫 提交于 2019-12-02 03:27:18
I am trying to create a matcher that finds negated custom entities in the text. It is working fine for entities that span a single token, but I am having trouble trying to capture entities that span more than one token. As an example, let's say that my custom entities are animals (and are labeled as token.ent_type_ = "animal" ) ["cat", "dog", "artic fox"] (note that the last entity has two words). Now I want to find those entities in the text but negated, so I can create a simple matcher with the following pattern: [{'lower': 'no'}, {'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}] And for example

how to write spacy matcher of POS regex

笑着哭i 提交于 2019-12-01 07:04:13
Spacy has two features I'd like to combine - part-of-speech (POS) and rule-based matching . How can I combine them in a neat way? For example - let's say input is a single sentence and I'd like to verify it meets some POS ordering condition - for example the verb is after the noun (something like noun**verb regex). result should be true or false. Is that doable? or the matcher is specific like in the example Rule-based matching can have POS rules? If not - here is my current plan - gather everything in one string and apply regex import spacy nlp = spacy.load('en') #doc = nlp(u'is there any way

How do I create gold data for TextCategorizer training?

十年热恋 提交于 2019-12-01 05:41:27
I want to train a TextCategorizer model with the following (text, label) pairs. Label COLOR : The door is brown. The barn is red. The flower is yellow. Label ANIMAL : The horse is running. The fish is jumping. The chicken is asleep. I am copying the example code in the documentation for TextCategorizer . textcat = TextCategorizer(nlp.vocab) losses = {} optimizer = nlp.begin_training() textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) The doc variables will presumably be just nlp("The door is brown.") and so on. What should be in gold1 and gold2 ? I'm guessing they