问题
I have a huge .csv file (above 100 GB) in the form:
| Column1 | Column2 | Column3 | Column4 | Column5 |
|---------|---------|---------|---------|---------------------|
| A | B | 35 | X | 2017-12-19 11:28:34 |
| A | C | 22 | Z | 2017-12-19 11:27:24 |
| A | B | 678 | Y | 2017-12-19 11:38:36 |
| C | A | 93 | X | 2017-12-19 11:44:42 |
And want to summarize it
- by the unique values in Column1 and Column2
- with sum(Column3),
- max(Column5)
- the value of Column4, where Column5 was at its maximum.
Therefore the above extract should become:
| Column1 | Column2 | sum(Column3) | Column4 | max(Column5) |
|---------|---------|--------------|---------|---------------------|
| A | B | 702 | Y | 2017-12-19 11:38:36 |
| A | C | 22 | Z | 2017-12-19 11:27:24 |
| C | A | 93 | X | 2017-12-19 11:44:42 |
With these additional considerations:
- The .csv is not sorted
- I have python under windows
- The solution should be on a standalone PC (Cloud instances are not acceptable)
- I have tried Dask and the .compute() step (should it ever complete) will take about a week. Anything faster than this would be a good solution.
- I am open to all kinds of solutions - splitting the file into chunks, multiprocessing... whatever would work
Edit 1: I had not used multiprocessing in dask. Adding it improves the speed signifficantly (as suggested by one of the comments), but the 32G RAM is not enough for this approach to complete. Edit 2: Dask 0.16.0 is not a possible solution, as it is absolutely broken. After 5 hours of writing partitions to disk, it has written 8 out of 300 partitions and after reporting to have written 7, now it reports having written 4, instead of 8 (without throwing an error).
来源:https://stackoverflow.com/questions/47884199/group-a-huge-csv-file-in-python