Group a huge csv file in python

﹥>﹥吖頭↗ 提交于 2019-12-24 20:16:28

问题


I have a huge .csv file (above 100 GB) in the form:

| Column1 | Column2 | Column3 | Column4 | Column5             | 
|---------|---------|---------|---------|---------------------| 
| A       | B       | 35      | X       | 2017-12-19 11:28:34 | 
| A       | C       | 22      | Z       | 2017-12-19 11:27:24 | 
| A       | B       | 678     | Y       | 2017-12-19 11:38:36 | 
| C       | A       | 93      | X       | 2017-12-19 11:44:42 | 

And want to summarize it

  • by the unique values in Column1 and Column2
  • with sum(Column3),
  • max(Column5)
  • the value of Column4, where Column5 was at its maximum.

Therefore the above extract should become:

| Column1 | Column2 | sum(Column3) | Column4 | max(Column5)        | 
|---------|---------|--------------|---------|---------------------| 
| A       | B       | 702          | Y       | 2017-12-19 11:38:36 | 
| A       | C       | 22           | Z       | 2017-12-19 11:27:24 | 
| C       | A       | 93           | X       | 2017-12-19 11:44:42 |

With these additional considerations:

  • The .csv is not sorted
  • I have python under windows
  • The solution should be on a standalone PC (Cloud instances are not acceptable)
  • I have tried Dask and the .compute() step (should it ever complete) will take about a week. Anything faster than this would be a good solution.
  • I am open to all kinds of solutions - splitting the file into chunks, multiprocessing... whatever would work

Edit 1: I had not used multiprocessing in dask. Adding it improves the speed signifficantly (as suggested by one of the comments), but the 32G RAM is not enough for this approach to complete. Edit 2: Dask 0.16.0 is not a possible solution, as it is absolutely broken. After 5 hours of writing partitions to disk, it has written 8 out of 300 partitions and after reporting to have written 7, now it reports having written 4, instead of 8 (without throwing an error).

来源:https://stackoverflow.com/questions/47884199/group-a-huge-csv-file-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!