Reading huge csv files efficiently?

问题

I know how to use pandas to read files with CSV extension. When reading a large file i get an out of memory error. The file is 3.8 million rows and 6.4 millions columns file. There is mostly genome data in the file of large populations.

How can I overcome the problem, what is standard practice and how do I select the appropriate tool for this. Can I process a file this big with pandas, or there is another tool?

回答1:

You can use Apache Spark to distribute in-memory processing of csv files https://github.com/databricks/spark-csv. Take a look at ADAM's approach for distributed genomic data processing.

回答2:

You can use python csv module

with open(filename, "r") as csvfile:
    datareader = csv.reader(csvfile)
    for i in datareader:
        #process each line
        #You now only hold one row in memory, instead of your thousands of lines

来源：https://stackoverflow.com/questions/33689456/reading-huge-csv-files-efficiently

标签

csv

pandas

bigdata

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!