tokenize | 易学教程

Java NLP: Extracting Indicies When Tokenizing Text

阅读更多关于 Java NLP: Extracting Indicies When Tokenizing Text

问题 When tokenizing a string of text, I need to extract the indexes of the tokenized words. For example, given: "Mary didn't kiss John" I would need something like: [(Mary, 0), (did, 5), (n't, 8), (kiss, 12), (John, 17)] Where 0, 5, 8, 12 and 17 correspond to the index (in the original string) where the token began. I cannot rely on just whitespace, since some words become 2 tokens. Further, I cannot just search for the token in the string, since the word likely will appear multiple times. One

Passing a pandas dataframe column to an NLTK tokenizer

阅读更多关于 Passing a pandas dataframe column to an NLTK tokenizer

问题 I have a pandas dataframe raw_df with 2 columns, ID and sentences. I need to convert each sentence to a string. The code below produces no errors and says datatype of rule is "object." raw_df['sentences'] = raw_df.sentences.astype(str) raw.df.sentences.dtypes Out: dtype('O') Then, I try to tokenize sentences and get a TypeError that the method is expecting a string or bytes-like object. What am I doing wrong? raw_sentences=tokenizer.tokenize(raw_df) Same TypeError for raw_sentences = nltk

How to tokenize markdown using Node.js?

阅读更多关于 How to tokenize markdown using Node.js?

问题 Im building an iOS app that have a view that is going to have its source from markdown. My idea is to be able to parse markdown stored in MongoDB into a JSON-object that looks something like: { "h1": "This is the heading", "p" : "Heres the first paragraph", "link": { "text": "Text for link", "url": "http://exampledomain.com", } } On the server I am running Node.js, and was looking at the module marked which seem to be the most popular one out there. It gives me access to the Lexer, which is

C: strtok delivers segmentation fault

阅读更多关于 C: strtok delivers segmentation fault

问题 I am trying to read a file line by line, and tokenize each line, which have strings separated by spaces and tabs. However, when I run my program, I get the a Segmentation Fault error when I try to print the token. I don't understand why this is happening, as I am using a buffer as the string to tokenize and checking if the token is null. Below is my code: #include <stdio.h> #include <stdlib.h> #define MAX_LINE_LENGTH 70 int main(void) { FILE * testFile; char buf[MAX_LINE_LENGTH]; testFile =

How do I split a string on different delimiters, but keeping on the output some of said delimiters? (Tokenize a string)

阅读更多关于 How do I split a string on different delimiters, but keeping on the output some of said delimiters? (Tokenize a string)

问题 More specifically I want to split a string on any non alpha-numeric character but in the case that the delimiter is not a white space I want to keept it. That is, to the input: my_string = 'Hey, I\'m 9/11 7-11' I want to get: ['Hey' , ',' , 'I' , "'" , 'm', '9' , '/' , '11', '7' , '-' , '11'] Without no whitespace as a list element. I have tried the following: re.split('([/\'\-_,.;])|\s', my_string) But outputs: ['Hey', ',', '', None, 'I', "'", 'm', None, '9', '/', '11', None, '7', '-', '11']

How to add punctuation marks for the sentences?

阅读更多关于 How to add punctuation marks for the sentences?

问题 How to approach the problem of building a Punctuation Predictor? The working demo for the question can be found in this link. Input Text is as below: "its been a little while Kirk tells me its actually been three weeks now that Ive been using this device right here that is of course the Galaxy S ten I mean Ive just been living with this phone this has been my phone has the SIM card in it I took photos I lived live I sent tweets whatsapp slack email whatever other app this was my smart phone"

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

阅读更多关于 NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

问题 I have 2 sentences in my dataset: w1 = I am Pusheen the cat.I am so cute. # no space after period w2 = I am Pusheen the cat. I am so cute. # with space after period When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I. Here is word tokenize >>> nltk.word_tokenize(w1, 'english') ['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute'] >>> nltk.word_tokenize(w2, 'english') ['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute'] and sent tokenize >>

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

阅读更多关于 NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

阅读更多关于 NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

阅读更多关于 NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)