Operations on a very large csv with pandas

问题

I have been using pandas on csv files to get some values out of them. My data looks like this:

"A",23.495,41.995,"this is a sentence with some words"
"B",52.243,0.118,"More text but contains WORD1"
"A",119.142,-58.289,"Also contains WORD1"
"B",423.2535,292.3958,"Doesn't contain anything of interest"
"C",12.413,18.494,"This string contains WORD2"

I have a simple script to read the csv and create the frequencies of WORD by group so the output is like:

group freqW1 freqW2
A     1      0
B     1      0
C     0      1

Then do some other operations on the values. The problem is now I have to deal with very large csv files (20+ GB) that can't be held in memory. I tried the chunksize=x option in pd.read_csv, but because 'TextFileReader' object is not subscriptable, I can't do the necessary operations on the chunks.

I suspect there is some easy way to iterate through the csv and do what I want.

My code is like this:

df = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"])
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)

outfile = open("csv_out.txt","w", encoding='utf-8')
df1.to_csv(outfile, sep=",")
outfile.close()

回答1:

You can specify a chunksize option in the read_csv call. See here for details

Alternatively you could use the Python csv library and create your own csv Reader or DictReader and then use that to read in data in whatever chunk size you choose.

回答2:

Okay I misunderstood the chunk parameter. I solved it by doing this:

frame = pd.DataFrame()
chunks = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"],chunksize=1000000)
for df in chunks: 
    freq=Counter(df['group'])
    word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
    word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
    df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
    frame = frame.add(df1,fill_value=0)

outfile = open("csv_out.txt","w", encoding='utf-8')
frame.to_csv(outfile, sep=",")
outfile.close()

来源：https://stackoverflow.com/questions/43706333/operations-on-a-very-large-csv-with-pandas

标签

python

csv

pandas