spacy

Sentence Segmentation using Spacy

拥有回忆 提交于 2020-01-13 05:17:06
问题 I am new to Spacy and NLP. Facing the below issue while doing sentence segmentation using Spacy. The text I am trying to tokenise into sentences contains numbered lists(with space between numbering and actual text) . Like below. import spacy nlp = spacy.load('en_core_web_sm') text = "This is first sentence.\nNext is numbered list.\n1. Hello World!\n2. Hello World2!\n3. Hello World!" text_sentences = nlp(text) for sentence in text_sentences.sents: print(sentence.text) Output (1.,2.,3. are

How to install models/download packages on Google Colab?

丶灬走出姿态 提交于 2020-01-13 05:13:59
问题 I am using text analytics library "Spacy". I've installed spacy on Google Colab notebook without any issue. But for using it I need to download "en" model. Generally, that command should look like this: python -m spacy download en I tried few ways but I am not able to get it to install on the notebook. Looking for help. Cheers 回答1: If you have a Python interpreter but not a teriminal, you could try: import spacy.cli spacy.cli.download("en_core_web_sm") More manual alternatives can be found

Spacy - Tokenize quoted string

僤鯓⒐⒋嵵緔 提交于 2020-01-12 20:53:40
问题 I am using spacy 2.0 and using a quoted string as input. Example string "The quoted text 'AA XX' should be tokenized" and expecting to extract [The, quoted, text, 'AA XX', should, be, tokenized] I however get some strange results while experimenting. Noun chunks and ents looses one of the quote. import spacy nlp = spacy.load('en') s = "The quoted text 'AA XX' should be tokenized" doc = nlp(s) print([t for t in doc]) print([t for t in doc.noun_chunks]) print([t for t in doc.ents]) Result [The,

How does spacy use word embeddings for Named Entity Recognition (NER)?

我只是一个虾纸丫 提交于 2020-01-11 18:54:27
问题 I'm trying to train an NER model using spaCy to identify locations, (person) names, and organisations. I'm trying to understand how spaCy recognises entities in text and I've not been able to find an answer. From this issue on Github and this example, it appears that spaCy uses a number of features present in the text such as POS tags, prefixes, suffixes, and other character and word-based features in the text to train an Averaged Perceptron. However, nowhere in the code does it appear that

How does spacy use word embeddings for Named Entity Recognition (NER)?

99封情书 提交于 2020-01-11 18:52:27
问题 I'm trying to train an NER model using spaCy to identify locations, (person) names, and organisations. I'm trying to understand how spaCy recognises entities in text and I've not been able to find an answer. From this issue on Github and this example, it appears that spaCy uses a number of features present in the text such as POS tags, prefixes, suffixes, and other character and word-based features in the text to train an Averaged Perceptron. However, nowhere in the code does it appear that

How to improve accuracy of Rasa NLU while using Spacy as pipeline?

风格不统一 提交于 2020-01-06 05:00:07
问题 In Spacy documentation it is mentioned that it uses vector similarity in featurization and hence in classification. For example if we test a sentence which is not in the training data but has same meaning then it should be classified in same intent in which training sentences have classified. But it's not happening. Let's say training data is like this- ## intent: delete_event - delete event - delete all events - delete all events of friday - delete ... Now if I test remove event then it is

How to resolve Misaligned Entity Annotation error in RASA NLU

被刻印的时光 ゝ 提交于 2020-01-05 02:31:08
问题 I am trying to import a LUIS schema model into RASA and trying to train it using the spacy + scikit pipeline. I am using RASA NLU v0.10.4 But when I try to load the LUIS model schema the ner_crf component is throwing a Misaligned Entity Annotation warning. Although I have tagged the entities correctly in the LUIS model schema. Here is my config file: { "project": "SynonymsExample", "path": "C:\\Users\\xyz\\Desktop\\RASA\\models", "response_log": "C:\\Users\\xyz\\Desktop\\RASA\\logs",

subject object identification in python

泄露秘密 提交于 2020-01-02 23:20:31
问题 I want to identify subject and objects of a set of sentences . My actual work is to identify cause and effect from a set of review data. I am using Spacy Package to chunk and parse data. But not actually reaching my goal. Is there any way to do so? E.g.: I thought it was the complete set out: subject object I complete set 回答1: In the simplest way. The dependencies are accessed by token.dep_ Having imported spacy: import spacy nlp = spacy.load('en') parsed_text = nlp(u"I thought it was the

Older versions of spaCy throws “KeyError: 'package'” error when trying to install a model

南楼画角 提交于 2020-01-02 08:05:29
问题 I use spaCy 1.6.0 on Ubuntu 14.04.4 LTS x64 with python3.5. To install the English model of spaCy, I tried to run: This gives me the error message: ubun@ner-3:~/NeuroNER-master/src$ python3.5 -m spacy.en.download Downloading parsing model Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.5/dist-packages/spacy

spaCy - Tokenization of Hyphenated words

随声附和 提交于 2019-12-31 03:43:28
问题 Good day SO, I am trying to post-process hyphenated words that are tokenized into separate tokens when they were supposedly a single token. For example: Example: Sentence: "up-scaled" Tokens: ['up', '-', 'scaled'] Expected: ['up-scaled'] For now, my solution is to use the matcher: matcher = Matcher(nlp.vocab) pattern = [{'IS_ALPHA': True, 'IS_SPACE': False}, {'ORTH': '-'}, {'IS_ALPHA': True, 'IS_SPACE': False}] matcher.add('HYPHENATED', None, pattern) def quote_merger(doc): # this will be