Operations on a very large csv with pandas

人走茶凉 提交于 2020-11-30 01:40:25

问题


I have been using pandas on csv files to get some values out of them. My data looks like this:

"A",23.495,41.995,"this is a sentence with some words"
"B",52.243,0.118,"More text but contains WORD1"
"A",119.142,-58.289,"Also contains WORD1"
"B",423.2535,292.3958,"Doesn't contain anything of interest"
"C",12.413,18.494,"This string contains WORD2"

I have a simple script to read the csv and create the frequencies of WORD by group so the output is like:

group freqW1 freqW2
A     1      0
B     1      0
C     0      1

Then do some other operations on the values. The problem is now I have to deal with very large csv files (20+ GB) that can't be held in memory. I tried the chunksize=x option in pd.read_csv, but because 'TextFileReader' object is not subscriptable, I can't do the necessary operations on the chunks.

I suspect there is some easy way to iterate through the csv and do what I want.

My code is like this:

df = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"])
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)

outfile = open("csv_out.txt","w", encoding='utf-8')
df1.to_csv(outfile, sep=",")
outfile.close() 

回答1:


You can specify a chunksize option in the read_csv call. See here for details

Alternatively you could use the Python csv library and create your own csv Reader or DictReader and then use that to read in data in whatever chunk size you choose.




回答2:


Okay I misunderstood the chunk parameter. I solved it by doing this:

frame = pd.DataFrame()
chunks = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"],chunksize=1000000)
for df in chunks: 
    freq=Counter(df['group'])
    word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
    word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
    df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
    frame = frame.add(df1,fill_value=0)

outfile = open("csv_out.txt","w", encoding='utf-8')
frame.to_csv(outfile, sep=",")
outfile.close() 


来源:https://stackoverflow.com/questions/43706333/operations-on-a-very-large-csv-with-pandas

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!