spacy | 易学教程

Similarity between two lists of documents

阅读更多关于 Similarity between two lists of documents

问题 I need to find the similarity between two lists of the short texts in Python. Texts can be 1-4 word long. The length of the lists can be 10K each. I didn't find how to do this effectively in spaCy. Maybe other packages can do this? I assume the words are represented by a vector (300d), but any other options are also Ok. This task can be done in a cycle, but there should be a more effective way for sure. This task fits the TensorFlow, pyTorch, and similar packages, but I'm not familiar with

How to detokenize spacy text without doc context?

阅读更多关于 How to detokenize spacy text without doc context?

问题 I have a sequence to sequence model trained on tokens formed by spacy's tokenization. This is both encoder and decoder. The output is a stream of tokens from a seq2seq model. I want to detokenize the text to form natural text. Example: Input to Seq2Seq: Some text Output from Seq2Seq: This does n't work . Is there any API in spacy to reverse tokenization done by rules in its tokenizer? 回答1: TL;DR I've written a code that attempts to do it, the snippet is below. Another approach, with a

spacy sentence tokenization error on Hebrew

阅读更多关于 spacy sentence tokenization error on Hebrew

问题 Trying to use spacy sentence tokenization for Hebrew. import spacy nlp = spacy.load('he') doc = nlp(text) sents = list(doc.sents) I get: Warning: no model found for 'he' Only loading the 'he' tokenizer. Traceback (most recent call last): ... sents = list(doc.sents) File "spacy/tokens/doc.pyx", line 438, in __get__ (spacy/tokens/doc.cpp:9707) raise ValueError( ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the

ImportError [E048] Can't import language en from spacy.lang

阅读更多关于 ImportError [E048] Can't import language en from spacy.lang

问题 I am trying to run 'en' for Spacy library, which took a lot of debugging to install and finally got it to import in the python library. Next step to load 'en', I spent lot of time debugging why I can't load the files and unable to load in any type of scenarios. # in Python: These libraries are getting loaded. import spacy import ujson import en_core_web_sm In Command Line/ linux: I used command below to download 'en' for spacy. python -m spacy download en I got this successful message "You

How could spacy tokenize hashtag as a whole?

阅读更多关于 How could spacy tokenize hashtag as a whole?

问题 In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens: import spacy nlp = spacy.load('en') doc = nlp(u'This is a #sentence.') [t for t in doc] output: [This, is, a, #, sentence, .] I'd like to have hashtags tokenized as such: [This, is, a, #sentence, .] Is that possible? Thanks 回答1: You can do some pre and post string manipulations,which shall make you bypass '#' based tokenization, and is easy to implement. e.g > >>> import re > >>> import

How could spacy tokenize hashtag as a whole?

阅读更多关于 How could spacy tokenize hashtag as a whole?

Is 100 training examples sufficient for training custom NER using spacy? [closed]

阅读更多关于 Is 100 training examples sufficient for training custom NER using spacy? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 8 months ago . I have trained NER model for names data. I generated some random sentences which contain names of the person. I generated some 70 sentences and annotated the data in spacy's format. I trained custom NER using both blank 'en' model and 'en_core_web_sm' but when I tested on any

How to extract tag attributes using Spacy

阅读更多关于 How to extract tag attributes using Spacy

问题 I tried to get the morphological attributes of the verb using Spacy like below: import spacy from spacy.lang.it.examples import sentences nlp = spacy.load('it_core_news_sm') doc = nlp('Ti è piaciuto il film?') token = doc[2] nlp.vocab.morphology.tag_map[token.tag_] output was: {'pos': 'VERB'} But I want to extract V__Mood=Cnd|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB} Is it possible to extract the mood, tense,number,person information as specified in the tag-map https://github

Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

阅读更多关于 Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

问题 I want to train a word2vec model on the english wikipedia using python with gensim. I closely followed https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw for that. It works for me but what I don't like about the resulting word2vec model is that named entities are split which makes the model unusable for my specific application. The model I need has to represent named entities as a single vector. Thats why I planned to parse the wikipedia articles with spacy and merge entities like

Why do my lists become strings after saving to csv and re-opening? Python

阅读更多关于 Why do my lists become strings after saving to csv and re-opening? Python

问题 I have a Dataframe in which each row contains a sentence followed by a list of part-of-speech tags, created with spaCy: df.head() question POS_tags 0 A title for my ... [DT, NN, IN,...] 1 If one of the ... [IN, CD, IN,...] When I write the DataFrame to a csv file (encoding='utf-8') and re-open it, it looks like the data format has changed with the POS tags now appearing between quotes ' ' like this: df.head() question POS_tags 0 A title for my ... ['DT', 'NN', 'IN',...] 1 If one of the ... [