tokenize

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

不羁岁月 提交于 2021-02-09 08:16:00
问题 I have 2 sentences in my dataset: w1 = I am Pusheen the cat.I am so cute. # no space after period w2 = I am Pusheen the cat. I am so cute. # with space after period When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I. Here is word tokenize >>> nltk.word_tokenize(w1, 'english') ['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute'] >>> nltk.word_tokenize(w2, 'english') ['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute'] and sent tokenize >>

Modify NLTK word_tokenize to prevent tokenization of parenthesis

巧了我就是萌 提交于 2021-02-08 07:32:48
问题 I have the following main.py . #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8: import nltk import string import sys for token in nltk.word_tokenize(''.join(sys.stdin.readlines())): #print token if len(token) == 1 and not token in string.punctuation or len(token) > 1: print token The output is the following. ./main.py <<< 'EGR1(-/-) mouse embryonic fibroblasts' EGR1 -/- mouse embryonic fibroblasts I want to slightly change the tokenizer so

NLTK Sentence Tokenizer, custom sentence starters

拈花ヽ惹草 提交于 2021-02-08 05:29:23
问题 I'm trying to split a text into sentences with the PunktSentenceTokenizer from nltk. The text contains listings starting with bullet points, but they are not recognized as new sentences. I tried to add some parameters but that didn't work. Is there another way? Here is some example code: from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters params = PunktParameters() params.sent_starters = set(['•']) tokenizer = PunktSentenceTokenizer(params) tokenizer.tokenize('• I am a

How do I split a word's letters into an Array in C#?

天大地大妈咪最大 提交于 2021-02-07 13:36:44
问题 How do I split a string into an array of characters in C#? Example String word used is "robot". The program should print out: r o b o t The orginal code snippet: using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Diagnostics; using System.IO; using System.Text.RegularExpressions; namespace Testing { class Program { static void Main(string[] args) { String word = "robot"; String[] token = word.Split(); // Something should be placed into the () to

How do I split a word's letters into an Array in C#?

旧街凉风 提交于 2021-02-07 13:35:49
问题 How do I split a string into an array of characters in C#? Example String word used is "robot". The program should print out: r o b o t The orginal code snippet: using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Diagnostics; using System.IO; using System.Text.RegularExpressions; namespace Testing { class Program { static void Main(string[] args) { String word = "robot"; String[] token = word.Split(); // Something should be placed into the () to

Detect exact words positions in text in JavaScript

一世执手 提交于 2021-02-05 11:40:47
问题 I have a text in which some words may repeat. I have to detect words occurrences for each word like: { "index": 10, "word": "soul", "characterOffsetBegin": 1606, "characterOffsetEnd": 1609 } I have implemented this approach that partially works var seen = new Map(); tokens.forEach(token => { // for each token let item = { "word": token } var pattern = "\\b($1)\\b"; var wordRegex = new RegExp(pattern.replace('$1', token), "g"); // calculate token begin end var match = null; while ((match =

How to split a string (using regex?) depending on digit/ not digit

99封情书 提交于 2021-02-05 09:26:34
问题 I want to split a string into a list in python, depending on digit/ not digit. For example, 5 55+6+ 5/ should return ['5','55','+','6','+','5','/'] I have some code at the moment which loops through the characters in a string and tests them using re.match("\d") or ("\D"). I was wondering if there was a better way of doing this. P.S: must be compatible with python 2.4 回答1: Assuming the + between 6 and 5 needs to be matched (which you're missing), >>> import re >>> s = '5 55+6+ 5/' >>> re

How to find the lemmas and frequency count of each word in list of sentences in a list?

醉酒当歌 提交于 2021-01-28 12:43:52
问题 I want to find out the lemmas using WordNet Lemmatizer and also I need to compute each word frequency. I am getting the following error. The trace is as follows: TypeError: unhashable type: 'list' Note: The corpus is available on the nltk package itself. What I have tried so far is as follows: import nltk, re import string from collections import Counter from string import punctuation from nltk.tokenize import TweetTokenizer, sent_tokenize, word_tokenize from nltk.corpus import gutenberg,

How to find the lemmas and frequency count of each word in list of sentences in a list?

孤街浪徒 提交于 2021-01-28 12:42:00
问题 I want to find out the lemmas using WordNet Lemmatizer and also I need to compute each word frequency. I am getting the following error. The trace is as follows: TypeError: unhashable type: 'list' Note: The corpus is available on the nltk package itself. What I have tried so far is as follows: import nltk, re import string from collections import Counter from string import punctuation from nltk.tokenize import TweetTokenizer, sent_tokenize, word_tokenize from nltk.corpus import gutenberg,

Spacy tokenizer, add tokenizer exception

早过忘川 提交于 2021-01-28 09:55:04
问题 Hey! I am trying to add an exception at tokenizing some tokens using spacy 2.02, I know that exists .tokenizer.add_special_case() which I am using for some cases but for example a token like US$100, spacy splits in two token ('US$', 'SYM'), ('100', 'NUM') But I want to split in three like this, instead of doing a special case for each number after the us$, i want to make an excpetion for every token that has a forma of US$NUMBER. ('US', 'PROPN'), ('$', 'SYM'), ('800', 'NUM') I was reading