DASK: Typerrror: Column assignment doesn't support type numpy.ndarray whereas Pandas works fine

问题

I'm using Dask to read in a 10m row csv+ and perform some calculations. So far it's proving to be 10x faster than Pandas.

I have a piece of code, below, that when used with pandas works fine, but with dask throws a type error. I am unsure of how to overcome the typerror. It seems like an array is being handed back to the dataframe/column by the select function when using dask, but not when using pandas? But I don't want to switch the whole thing back to pandas and lose the 10x performance benefit.

This answer is the result of some help of some others on Stack Overflow, however I think that question has deviated far enough from the initial question that this is altogether different. Code below.

PANDAS: Works Time Taken excluding AndHeathSolRadFact: 40 seconds

import pandas as pd
import numpy as np

from timeit import default_timer as timer
start = timer()
df = pd.read_csv(r'C:\Users\i5-Desktop\Downloads\Weathergrids.csv')
df['DateTime'] = pd.to_datetime(df['Date'], format='%Y-%d-%m %H:%M')
df['Month'] = df['DateTime'].dt.month
df['Grass_FMC'] = (97.7+4.06*df['RH'])/(df['Temperature']+6)-0.00854*df['RH']+3000/df['Curing']-30


df["AndHeathSolRadFact"] = np.select(
    [
    (df['Month'].between(8,12)),
    (df['Month'].between(1,2) & df['CloudCover']>30)
    ],  #list of conditions
    [1, 1],     #list of results
    default=0)    #default if no match



print(df.head())
#print(ddf.tail())
end = timer()
print(end - start)

DASK: BROKEN Time Taken excluding AndHeathSolRadFact: 4 seconds

import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
import numpy as np

# Dataframes implement the Pandas API
import dask.dataframe as dd



from timeit import default_timer as timer
start = timer()
ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\Weathergrids.csv')
ddf['DateTime'] = dd.to_datetime(ddf['Date'], format='%Y-%d-%m %H:%M')
ddf['Month'] = ddf['DateTime'].dt.month
ddf['Grass_FMC'] = (97.7+4.06*ddf['RH'])/(ddf['Temperature']+6)-0.00854*ddf['RH']+3000/ddf['Curing']-30



ddf["AndHeathSolRadFact"] = np.select(
    [
    (ddf['Month'].between(8,12)),
    (ddf['Month'].between(1,2) & ddf['CloudCover']>30)
    ],  #list of conditions
    [1, 1],     #list of results
    default=0)    #default if no match



print(ddf.head())
#print(ddf.tail())
end = timer()
print(end - start)

Error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-50-86c08f38bce6> in <module>
     29     ],  #list of conditions
     30     [1, 1],     #list of results
---> 31     default=0)    #default if no match
     32 
     33 

~\Anaconda3\lib\site-packages\dask\dataframe\core.py in __setitem__(self, key, value)
   3276             df = self.assign(**{k: value for k in key})
   3277         else:
-> 3278             df = self.assign(**{key: value})
   3279 
   3280         self.dask = df.dask

~\Anaconda3\lib\site-packages\dask\dataframe\core.py in assign(self, **kwargs)
   3510                 raise TypeError(
   3511                     "Column assignment doesn't support type "
-> 3512                     "{0}".format(typename(type(v)))
   3513                 )
   3514             if callable(v):

TypeError: Column assignment doesn't support type numpy.ndarray

Sample Weathegrids CSV

Location,Date,Temperature,RH,WindDir,WindSpeed,DroughtFactor,Curing,CloudCover
1075,2019-20-09 04:00,6.8,99.3,143.9,5.6,10.0,93.0,1.0 
1075,2019-20-09 05:00,6.4,100.0,93.6,7.2,10.0,93.0,1.0
1075,2019-20-09 06:00,6.7,99.3,130.3,6.9,10.0,93.0,1.0
1075,2019-20-09 07:00,8.6,95.4,68.5,6.3,10.0,93.0,1.0
1075,2019-20-09 08:00,12.2,76.0,86.4,6.1,10.0,93.0,1.0

回答1:

This answer isn't elegant but is functional.

I found the select function was about 20 seconds quicker on an 11m row dataset in pandas. I also found that even if I performed the same function in dask that the result would return a numpy (pandas) array. Dask inherently cannot accept this, but it is possible to transfer dataframes between dask and pandas.

So, I get the benefit of loading and date transforms in dask (4 seconds compared to 40 seconds in pandas), the benefits of select using pandas (40 seconds compared to 60 seconds in dask), and just need to accept that I'll be using more memory.

There's little loss in time by transforming between dataframes.

Finally, I had to make sure that I cleaned up dataframes as python wasn't cleaning up memory between test runs and just kept accumulating.

import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
import numpy as np

# Dataframes implement the Pandas API
import dask.dataframe as dd

from timeit import default_timer as timer
start = timer()
ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\Weathergrids.csv')
#print(ddf.describe(include='all'))

#Wrangle the dates so we can interrogate them
ddf['DateTime'] = dd.to_datetime(ddf['Date'], format='%Y-%d-%m %H:%M')
ddf['Month'] = ddf['DateTime'].dt.month

#Grass Fuel Moisture Content
ddf['Grass_FMC'] = (97.7+4.06*ddf['RH'])/(ddf['Temperature']+6)-0.00854*ddf['RH']+3000/ddf['Curing']-30

#Convert to a Pandas DataFrame because dask was being slow with the select logic below
df = ddf.compute() 
del [ddf]

#ddf["AndHeathSolRadFact"] = np.select(
#Solar Radiation Factor - this seems to take 32 seconds. Why?
df["AndHeathSolRadFact"] = np.select(
    [
    (df['Month'].between(8,12)),
    (df['Month'].between(1,2) & df['CloudCover']>30)
    ],  #list of conditions
    [1, 1],     #list of results
    default=0)    #default if no match

#Convert back to a Dask dataframe because we want that juicy parallelism
ddf2 = dd.from_pandas(df,npartitions=4)
del [df]

print(ddf2.head())
#print(ddf.tail())
end = timer()
print(end - start)

#Clean up remaining dataframes
del [[ddf2]]

回答2:

Can you please try with adding .any() or .all() at the end of you np.select() statement?

df["AndHeathSolRadFact"] = np.select(
    [
    (df['Month'].between(8,12)),
    (df['Month'].between(1,2) & df['CloudCover']>30)
    ],  #list of conditions
    [1, 1],     #list of results
    default=0).all()    #default if no match

回答3:

I really have a elegant solution for you problem:-

df.compute()['Name of you column'] = the_list_you_want_to_assign_as_column

来源：https://stackoverflow.com/questions/58254236/dask-typerrror-column-assignment-doesnt-support-type-numpy-ndarray-whereas-pa

标签

python

pandas

numpy

dask