问题
I am trying to perform sentiment analysis on the large set of data from social network. The part of the code works great with small size of data.
The input size less than 20mb has no problem computing. But if the size is more than 20mb I am getting memory error.
Environment: Windows 10, anaconda 3.x with updated version packages.
Code:
def captionsenti(F_name):
print ("reading from csv file")
F1_name="caption_senti.csv"
df=pd.read_csv(path+F_name+".csv")
filename=path+F_name+"_"+F1_name
df1=df['tweetText'] # reading caption from data5 file
df1=df1.fillna("h") # filling NaN values
df2=pd.DataFrame()
sid = SentimentIntensityAnalyzer()
print ("calculating sentiment")
for sentence in df1:
#print(sentence)
ss = sid.polarity_scores(sentence) # calculating sentiments
#print ss
df2=df2.append(pd.DataFrame({'tweetText':sentence ,'positive':ss['pos'],'negative':ss['neg'],'neutral':ss['neu'],
'compound':ss['compound']},index=[0]))
df2=df2.join(df.set_index('tweetText'), on='tweetText') # joining two data frames
df2=df2.drop_duplicates(subset=None, keep='first', inplace=False)
df2=df2.dropna(how='any')
df2=df2[['userID','tweetSource','tweetText','positive','neutral','negative','compound','latitude','longitude']]
#print df2
print ("Storing in csv file")
df2.to_csv(filename,encoding='utf-8',header=True,index=True,chunksize=100)
What extra do I need to include to avoid the memory error Thanks for the help in advance.
回答1:
You don't need extra anything, you need less. Why do you load all the tweets in memory at once? If you just deal with one tweet at a time, you can process terabytes of data with less memory than you'll find in a bottom-end smartphone.
reader = csv.DictReader(open(F1_name))
fieldnames = ["TweetText", "positive", "negative", ...]
writer = csv.DictWriter(open(output_filename, "w"), fieldnames=fieldnames)
writer.writeheader()
for row in reader:
sentence = row["TweetText"]
ss = sid.polarity_scores(sentence)
row['positive'] = ss['pos']
row['negative'] = ss['neg']
<etc.>
writer.writerow(row)
Or something like that. I didn't bother to close your filehandles, but you should. There are all sorts of tweaks and adjustments you can make, but the point is: There's no reason to blow up your memory when you're analyzing one tweet at a time.
回答2:
Some general tips that might help you:
1. Load only the columns that you need to memory:
pd.read_csv
provide usecols parameters to specify which columns you want to read
df = pd.read_csv(path+F_name+".csv", usecols=['col1', 'col2'])
2. Delete unused variables
if you no longer need a variable, delete it with del variable_name
3. Use memory profiler
Profile the memory memory_profiler. Citing the example's memory log from the documentation, you get a memory profile like the following:
Line # Mem usage Increment Line Contents
==============================================
3 @profile
4 5.97 MB 0.00 MB def my_func():
5 13.61 MB 7.64 MB a = [1] * (10 ** 6)
6 166.20 MB 152.59 MB b = [2] * (2 * 10 ** 7)
7 13.61 MB -152.59 MB del b
8 13.61 MB 0.00 MB return a
来源:https://stackoverflow.com/questions/46490474/memory-error-performing-sentiment-analysis-large-size-data