Reshape a dask array (obtained from a dask dataframe column)

半腔热情 提交于 2019-12-11 07:56:48

问题


I am new to dask and am trying to figure out how to reshape a dask array that I've obtained from a single column of a dask dataframe and am running into errors. Wondering if anyone might know of the fix (without having to force a compute)? Thanks!

Example:

import pandas as pd
import numpy as np
from dask import dataframe as dd, array as da
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
ddf = dd.from_pandas(df, npartitions=2)

# This does not work - error ValueError: cannot convert float NaN to integer
ddf['x'].values.reshape([-1,1])

# this works, but requires a compute
ddf['x'].values.compute().reshape([-1,1])

# this works, if the dask array is created directly from a np array
ar = np.array([1, 2, 3])
dar = da.from_array(ar, chunks=2)
dar.reshape([-1,1])

回答1:


Also:

ddf['x'].to_dask_array(lengths=True).reshape([-1,1])



回答2:


Unfortunately, then length of a dataframe and its pieces is generally lazy in Dask, and only computed on explicit request. That means that the array doesn't know its length or partitioning either, and so you can't reshape. The following clunky code gets around this, but I feel there should be a simpler way.

Find the chunks:

chunks = tuple(ddf['x'].map_partitions(len).compute())
size = sum(chunks)

Create a new array object with the now-known chunks and size:

a = ddf['x'].values
arr = da.Array(a.dask, a.name, chunks, a.dtype, (size,))


来源:https://stackoverflow.com/questions/52212827/reshape-a-dask-array-obtained-from-a-dask-dataframe-column

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!