I try to manipulate a large CSV file using Pandas, when I wrote this
df = pd.read_csv(strFileName,sep=\'\\t\',delimiter=\'\\t\')
it raises
What @chrisaycock suggested is the preferred method if you need to sum or count
If you need to average, it won't work because avg(a,b,c,d) does not equal avg(avg(a,b),avg(c,d))
I suggest using a map-reduce like approach, with streaming
create a file called map-col.py
import sys
for line in sys.stdin:
print (line.split('\t')[col])
And a file named reduce-avg.py
import sys
s=0
n=0
for line in sys.stdin:
s=s+float(line)
n=n+1
print (s/n)
And in order to run the whole thing:
cat strFileName|python map-col.py|python reduce-avg.py>output.txt
This method will work regardless of the size of the file, and will not run out of memory