Reading huge csv files efficiently?

五迷三道 提交于 2019-12-24 14:38:57

问题


I know how to use pandas to read files with CSV extension. When reading a large file i get an out of memory error. The file is 3.8 million rows and 6.4 millions columns file. There is mostly genome data in the file of large populations.

How can I overcome the problem, what is standard practice and how do I select the appropriate tool for this. Can I process a file this big with pandas, or there is another tool?


回答1:


You can use Apache Spark to distribute in-memory processing of csv files https://github.com/databricks/spark-csv. Take a look at ADAM's approach for distributed genomic data processing.




回答2:


You can use python csv module

with open(filename, "r") as csvfile:
    datareader = csv.reader(csvfile)
    for i in datareader:
        #process each line
        #You now only hold one row in memory, instead of your thousands of lines


来源:https://stackoverflow.com/questions/33689456/reading-huge-csv-files-efficiently

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!