What is the way to add an index column in Dask when reading from a CSV?

為{幸葍}努か 提交于 2020-01-15 10:33:53

问题


I'm trying to process a fairly large dataset that doesn't fit into memory using Pandas when loading it at once so I'm using Dask. However, I'm having difficulty in adding a unique ID column to the dataset once read when using the read_csv method. I keep getting an error (see Code). I'm trying to create an index column so I can set that new column as the index for the data, but the error appears to be telling me to set the index first before creating the column.

CODE

df = dd.read_csv(r'path\to\file\file.csv')  # File does not have a unique ID column, so I have to create one.
df['index_col'] = dd.from_array(np.arange(len(pc_df)))  # Trying to add an index column and fill it
# ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

Update

Using range(1, len(df) + 1 changed the error to: TypeError: Column assignment doesn't support type range


回答1:


Right, it's hard to know number of lines in each chunk of a CSV file without reading through it, so it's hard to produce an index like 0, 1, 2, 3, ... if the dataset spans multiple partitions.

One approach would be to create a column of ones:

df["idx"] = 1

and then call cumsum

df["idx"] = df["idx"].cumsum()

But note that this does add a bunch of dependencies to the task graph that backs your dataframe, so some operations might not be as parallel as they were before.



来源:https://stackoverflow.com/questions/58525020/what-is-the-way-to-add-an-index-column-in-dask-when-reading-from-a-csv

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!