问题
I am new to dask and am trying to figure out how to reshape a dask array that I've obtained from a single column of a dask dataframe and am running into errors. Wondering if anyone might know of the fix (without having to force a compute)? Thanks!
Example:
import pandas as pd
import numpy as np
from dask import dataframe as dd, array as da
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
ddf = dd.from_pandas(df, npartitions=2)
# This does not work - error ValueError: cannot convert float NaN to integer
ddf['x'].values.reshape([-1,1])
# this works, but requires a compute
ddf['x'].values.compute().reshape([-1,1])
# this works, if the dask array is created directly from a np array
ar = np.array([1, 2, 3])
dar = da.from_array(ar, chunks=2)
dar.reshape([-1,1])
回答1:
Also:
ddf['x'].to_dask_array(lengths=True).reshape([-1,1])
回答2:
Unfortunately, then length of a dataframe and its pieces is generally lazy in Dask, and only computed on explicit request. That means that the array doesn't know its length or partitioning either, and so you can't reshape. The following clunky code gets around this, but I feel there should be a simpler way.
Find the chunks:
chunks = tuple(ddf['x'].map_partitions(len).compute())
size = sum(chunks)
Create a new array object with the now-known chunks and size:
a = ddf['x'].values
arr = da.Array(a.dask, a.name, chunks, a.dtype, (size,))
来源:https://stackoverflow.com/questions/52212827/reshape-a-dask-array-obtained-from-a-dask-dataframe-column