Dask Array from DataFrame

不想你离开。 提交于 2020-02-13 09:20:42

问题


Is there a way to easily convert a DataFrame of numeric values into an Array? Similar to values with a pandas DataFrame. I can't seem to find any way to do this with the provided API, but I'd assume it's a common operation.


回答1:


Edit: yes, now this is trivial

You can use the .values property

x = df.values

Older, now incorrect answer

At the moment there is no trivial way to do this. This is because dask.array needs to know the length of all of its chunks and dask.dataframe doesn't know this length. This can not be a completely lazy operation.

That being said, you can accomplish it using dask.delayed as follows:

import dask.array as da
from dask import compute

def to_dask_array(df):
    partitions = df.to_delayed()
    shapes = [part.values.shape for part in partitions]
    dtype = partitions[0].dtype

    results = compute(dtype, *shapes)  # trigger computation to find shape
    dtype, shapes = results[0], results[1:]

    chunks = [da.from_delayed(part.values, shape, dtype) 
              for part, shape in zip(partitions, shapes)]
    return da.concatenate(chunks, axis=0)



回答2:


I think, there might be another way shorter.

import dask.array as da
import dask.dataframe as df

ruta ='...'
df = dd.read_csv(...)
x = df_reg['column you want to transform in array']

def transf(x):
    xd=x.to_delayed()
    full = [da.from_delayed(i, i.compute().shape, i.compute().dtype) for i in xd]
    return da.concatenate(full)

x_array=transf(x)

In addition, if you want to convert a DaskDataframe with N columns, and therefore, each array element will be another array like this:

array((x,x2,x3),(y1,y2,y3),....)

You have to change the order:

from:

i.compute().dtype 

to

i.compute().dtypes

Thanks



来源:https://stackoverflow.com/questions/37444943/dask-array-from-dataframe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!