tokenize

Java NLP: Extracting Indicies When Tokenizing Text

六眼飞鱼酱① 提交于 2021-02-20 04:54:46
问题 When tokenizing a string of text, I need to extract the indexes of the tokenized words. For example, given: "Mary didn't kiss John" I would need something like: [(Mary, 0), (did, 5), (n't, 8), (kiss, 12), (John, 17)] Where 0, 5, 8, 12 and 17 correspond to the index (in the original string) where the token began. I cannot rely on just whitespace, since some words become 2 tokens. Further, I cannot just search for the token in the string, since the word likely will appear multiple times. One

Passing a pandas dataframe column to an NLTK tokenizer

纵然是瞬间 提交于 2021-02-18 12:59:15
问题 I have a pandas dataframe raw_df with 2 columns, ID and sentences. I need to convert each sentence to a string. The code below produces no errors and says datatype of rule is "object." raw_df['sentences'] = raw_df.sentences.astype(str) raw.df.sentences.dtypes Out: dtype('O') Then, I try to tokenize sentences and get a TypeError that the method is expecting a string or bytes-like object. What am I doing wrong? raw_sentences=tokenizer.tokenize(raw_df) Same TypeError for raw_sentences = nltk

How to tokenize markdown using Node.js?

ⅰ亾dé卋堺 提交于 2021-02-18 10:43:12
问题 Im building an iOS app that have a view that is going to have its source from markdown. My idea is to be able to parse markdown stored in MongoDB into a JSON-object that looks something like: { "h1": "This is the heading", "p" : "Heres the first paragraph", "link": { "text": "Text for link", "url": "http://exampledomain.com", } } On the server I am running Node.js, and was looking at the module marked which seem to be the most popular one out there. It gives me access to the Lexer, which is

C: strtok delivers segmentation fault

跟風遠走 提交于 2021-02-17 06:38:06
问题 I am trying to read a file line by line, and tokenize each line, which have strings separated by spaces and tabs. However, when I run my program, I get the a Segmentation Fault error when I try to print the token. I don't understand why this is happening, as I am using a buffer as the string to tokenize and checking if the token is null. Below is my code: #include <stdio.h> #include <stdlib.h> #define MAX_LINE_LENGTH 70 int main(void) { FILE * testFile; char buf[MAX_LINE_LENGTH]; testFile =

How do I split a string on different delimiters, but keeping on the output some of said delimiters? (Tokenize a string)

萝らか妹 提交于 2021-02-16 20:53:51
问题 More specifically I want to split a string on any non alpha-numeric character but in the case that the delimiter is not a white space I want to keept it. That is, to the input: my_string = 'Hey, I\'m 9/11 7-11' I want to get: ['Hey' , ',' , 'I' , "'" , 'm', '9' , '/' , '11', '7' , '-' , '11'] Without no whitespace as a list element. I have tried the following: re.split('([/\'\-_,.;])|\s', my_string) But outputs: ['Hey', ',', '', None, 'I', "'", 'm', None, '9', '/', '11', None, '7', '-', '11']

How to add punctuation marks for the sentences?

試著忘記壹切 提交于 2021-02-16 15:39:06
问题 How to approach the problem of building a Punctuation Predictor? The working demo for the question can be found in this link. Input Text is as below: "its been a little while Kirk tells me its actually been three weeks now that Ive been using this device right here that is of course the Galaxy S ten I mean Ive just been living with this phone this has been my phone has the SIM card in it I took photos I lived live I sent tweets whatsapp slack email whatever other app this was my smart phone"

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

老子叫甜甜 提交于 2021-02-09 08:21:00
问题 I have 2 sentences in my dataset: w1 = I am Pusheen the cat.I am so cute. # no space after period w2 = I am Pusheen the cat. I am so cute. # with space after period When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I. Here is word tokenize >>> nltk.word_tokenize(w1, 'english') ['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute'] >>> nltk.word_tokenize(w2, 'english') ['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute'] and sent tokenize >>

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

↘锁芯ラ 提交于 2021-02-09 08:20:31
问题 I have 2 sentences in my dataset: w1 = I am Pusheen the cat.I am so cute. # no space after period w2 = I am Pusheen the cat. I am so cute. # with space after period When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I. Here is word tokenize >>> nltk.word_tokenize(w1, 'english') ['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute'] >>> nltk.word_tokenize(w2, 'english') ['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute'] and sent tokenize >>

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

我的未来我决定 提交于 2021-02-09 08:17:29
问题 I have 2 sentences in my dataset: w1 = I am Pusheen the cat.I am so cute. # no space after period w2 = I am Pusheen the cat. I am so cute. # with space after period When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I. Here is word tokenize >>> nltk.word_tokenize(w1, 'english') ['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute'] >>> nltk.word_tokenize(w2, 'english') ['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute'] and sent tokenize >>

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

有些话、适合烂在心里 提交于 2021-02-09 08:16:42
问题 I have 2 sentences in my dataset: w1 = I am Pusheen the cat.I am so cute. # no space after period w2 = I am Pusheen the cat. I am so cute. # with space after period When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I. Here is word tokenize >>> nltk.word_tokenize(w1, 'english') ['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute'] >>> nltk.word_tokenize(w2, 'english') ['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute'] and sent tokenize >>