How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?

前端 未结 1 1311
半阙折子戏
半阙折子戏 2020-12-01 13:35

This is the Code that I am using for semantic analysis of twitter:-

import pandas as pd
import datetime
import numpy as np
import re
from nltk.tokenize impor         


        
相关标签:
1条回答
  • 2020-12-01 14:00

    In short:

    df['Text'].apply(word_tokenize)
    

    Or if you want to add another column to store the tokenized list of strings:

    df['tokenized_text'] = df['Text'].apply(word_tokenize) 
    

    There are tokenizers written specifically for twitter text, see http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual

    To use nltk.tokenize.TweetTokenizer:

    from nltk.tokenize import TweetTokenizer
    tt = TweetTokenizer()
    df['Text'].apply(tt.tokenize)
    

    Similar to:

    • How to apply pos_tag_sents() to pandas dataframe efficiently

    • how to use word_tokenize in data frame

    • How to apply pos_tag_sents() to pandas dataframe efficiently

    • Tokenizing words into a new column in a pandas dataframe

    • Run nltk sent_tokenize through Pandas dataframe

    • Python text processing: NLTK and pandas

    0 讨论(0)
提交回复
热议问题