Memory Error, performing sentiment analysis large size data

问题

I am trying to perform sentiment analysis on the large set of data from social network. The part of the code works great with small size of data.

The input size less than 20mb has no problem computing. But if the size is more than 20mb I am getting memory error.

Environment: Windows 10, anaconda 3.x with updated version packages.

Code:

def captionsenti(F_name): 
    print ("reading from csv file")
    F1_name="caption_senti.csv"
    df=pd.read_csv(path+F_name+".csv")
    filename=path+F_name+"_"+F1_name
    df1=df['tweetText']   # reading caption from data5 file
    df1=df1.fillna("h") # filling NaN values
    df2=pd.DataFrame()
    sid = SentimentIntensityAnalyzer()
    print ("calculating sentiment")
    for sentence in df1:
        #print(sentence)
        ss = sid.polarity_scores(sentence)  # calculating sentiments
        #print ss
        df2=df2.append(pd.DataFrame({'tweetText':sentence ,'positive':ss['pos'],'negative':ss['neg'],'neutral':ss['neu'],
                                 'compound':ss['compound']},index=[0]))


    df2=df2.join(df.set_index('tweetText'), on='tweetText') # joining two data frames
    df2=df2.drop_duplicates(subset=None, keep='first', inplace=False)
    df2=df2.dropna(how='any') 
    df2=df2[['userID','tweetSource','tweetText','positive','neutral','negative','compound','latitude','longitude']]
    #print df2
    print ("Storing in csv file")
    df2.to_csv(filename,encoding='utf-8',header=True,index=True,chunksize=100)

What extra do I need to include to avoid the memory error Thanks for the help in advance.

回答1:

You don't need extra anything, you need less. Why do you load all the tweets in memory at once? If you just deal with one tweet at a time, you can process terabytes of data with less memory than you'll find in a bottom-end smartphone.

reader = csv.DictReader(open(F1_name))
fieldnames = ["TweetText", "positive", "negative", ...]
writer = csv.DictWriter(open(output_filename, "w"), fieldnames=fieldnames)
writer.writeheader()

for row in reader:
    sentence = row["TweetText"]
    ss = sid.polarity_scores(sentence)
    row['positive'] = ss['pos']
    row['negative'] = ss['neg']
    <etc.>
    writer.writerow(row)

Or something like that. I didn't bother to close your filehandles, but you should. There are all sorts of tweaks and adjustments you can make, but the point is: There's no reason to blow up your memory when you're analyzing one tweet at a time.

回答2:

Some general tips that might help you:

1. Load only the columns that you need to memory:

pd.read_csv provide usecols parameters to specify which columns you want to read

df = pd.read_csv(path+F_name+".csv", usecols=['col1', 'col2'])

2. Delete unused variables

if you no longer need a variable, delete it with del variable_name

3. Use memory profiler

Profile the memory memory_profiler. Citing the example's memory log from the documentation, you get a memory profile like the following:

Line #    Mem usage  Increment   Line Contents
==============================================
     3                           @profile
     4      5.97 MB    0.00 MB   def my_func():
     5     13.61 MB    7.64 MB       a = [1] * (10 ** 6)
     6    166.20 MB  152.59 MB       b = [2] * (2 * 10 ** 7)
     7     13.61 MB -152.59 MB       del b
     8     13.61 MB    0.00 MB       return a

来源：https://stackoverflow.com/questions/46490474/memory-error-performing-sentiment-analysis-large-size-data

标签

python

pandas

csv

nltk

sentiment-analysis