dask | 易学教程

s3fs gzip compression on pandas dataframe

阅读更多关于 s3fs gzip compression on pandas dataframe

问题 I'm trying to write a dataframe as a CSV file on S3 by using the s3fs library and pandas. Despite the documentation, I'm afraid the gzip compression parameter it's not working with s3fs. def DfTos3Csv (df,file): with fs.open(file,'wb') as f: df.to_csv(f, compression='gzip', index=False) This code saves the dataframe as a new object in S3 but in a plain CSV not in a gzip format. On the other hand, the read functionality it's working OK using this compression parameter. def s3CsvToDf(file):

How to read a compressed (gz) CSV file into a dask Dataframe?

阅读更多关于 How to read a compressed (gz) CSV file into a dask Dataframe?

问题 Is there a way to read a .csv file that is compressed via gz into a dask dataframe? I've tried it directly with import dask.dataframe as dd df = dd.read_csv("Data.gz" ) but get an unicode error (probably because it is interpreting the compressed bytes) There is a "compression" parameter but compression = "gz" won't work and I can't find any documentation so far. With pandas I can read the file directly without a problem other than the result blowing up my memory ;-) but if I restrict the

multiplication of large arrays in python

阅读更多关于 multiplication of large arrays in python

问题 I have big arrays to multiply in large number of iterations also. I am training a model with array long around 1500 and I will perform 3 multiplications for about 1000000 times which takes a long time almost week. I found Dask I tried to compare it with the normal numpy way but I found numpy faster: x = np.arange(2000) start = time.time() y = da.from_array(x, chunks=(100)) for i in range (0,100): p = y.dot(y) #print(p) print( time.time() - start) print('------------------------------') start

Slicing out a few rows from a `dask.DataFrame`

阅读更多关于 Slicing out a few rows from a `dask.DataFrame`

问题 Often, when working with a large dask.DataFrame , it would be useful to grab only a few rows on which to test all subsequent operations. Currently, according to Slicing a Dask Dataframe, this is unsupported. I was hoping to then use head to achieve the same (since that command is supported), but that returns a regular pandas DataFrame. I also tried df[:1000] , which executes, but generates an output different from that you'd expect from Pandas. Is there any way to grab the first 1000 rows

Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement

阅读更多关于 Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement

问题 I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls: DataFrame definition import pandas as pd import dask.dataframe as dd df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, "", 0.345, 0.40, 0.15]}) ddf = dd.from_pandas(df1, npartitions=2) Method-1 tried def funcUpdate(row): if row['y'].isnull(): return row['y'] else: return round((1 + row['x'])/(1+ 1/row['y']),4) ddf = ddf.assign

Default pip installation of Dask gives “ImportError: No module named toolz”

阅读更多关于 Default pip installation of Dask gives “ImportError: No module named toolz”

问题 I installed Dask using pip like this: pip install dask and when I try to do import dask.dataframe as dd I get the following error message: >>> import dask.dataframe as dd Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/path/to/venv/lib/python2.7/site-packages/dask/__init__.py", line 5, in <module> from .async import get_sync as get File "/path/to/venv/lib/python2.7/site-packages/dask/async.py", line 120, in <module> from toolz import identity ImportError: No

On Dask DataFrame.apply(), receiving n rows of value 1 before actual rows processed

阅读更多关于 On Dask DataFrame.apply(), receiving n rows of value 1 before actual rows processed

问题 In the below code snippet, I would expect the logs to print the numbers 0 - 4. I understand that the numbers may not be in that order, as the task would be broken up into a number of parallel operations. Code snippet: from dask import dataframe as dd import numpy as np import pandas as pd df = pd.DataFrame({'A': np.arange(5), 'B': np.arange(5), 'C': np.arange(5)}) ddf = dd.from_pandas(df, npartitions=1) def aggregate(x): print('B val received: ' + str(x.B)) return x ddf.apply(aggregate, axis

Dask rolling function by group syntax

阅读更多关于 Dask rolling function by group syntax

问题 I struggled for a while with the syntax to work for calculating a rolling function by group for a dask dataframe. The documentation is excellent, but in this case does not have an example. The working version I have is as follows, from a csv that contains a text field with User ids and and x, y, and z column: ddf = read_csv('./*.csv') ddf.groupby(ddf.User).x.apply(lambda x: x.rolling(5).mean(), meta=('x', 'f8')).compute() Is this the recommended syntax for rolling functions applied by group

Dask rolling function by group syntax

阅读更多关于 Dask rolling function by group syntax

dask dataframe how to convert column to to_datetime

阅读更多关于 dask dataframe how to convert column to to_datetime

问题 I am trying to convert one column of my dataframe to datetime. Following the discussion here https://github.com/dask/dask/issues/863 I tried the following code: import dask.dataframe as dd df['time'].map_partitions(pd.to_datetime, columns='time').compute() But I am getting the following error message ValueError: Metadata inference failed, please provide `meta` keyword What exactly should I put under meta? should I put a dictionary of ALL the columns in df or only of the 'time' column? and