问题
I know how to use pandas to read files with CSV extension. When reading a large file i get an out of memory error. The file is 3.8 million rows and 6.4 millions columns file. There is mostly genome data in the file of large populations.
How can I overcome the problem, what is standard practice and how do I select the appropriate tool for this. Can I process a file this big with pandas, or there is another tool?
回答1:
You can use Apache Spark to distribute in-memory processing of csv files https://github.com/databricks/spark-csv. Take a look at ADAM's approach for distributed genomic data processing.
回答2:
You can use python csv module
with open(filename, "r") as csvfile:
datareader = csv.reader(csvfile)
for i in datareader:
#process each line
#You now only hold one row in memory, instead of your thousands of lines
来源:https://stackoverflow.com/questions/33689456/reading-huge-csv-files-efficiently