dask

s3fs gzip compression on pandas dataframe

梦想与她 提交于 2020-01-03 16:48:12
问题 I'm trying to write a dataframe as a CSV file on S3 by using the s3fs library and pandas. Despite the documentation, I'm afraid the gzip compression parameter it's not working with s3fs. def DfTos3Csv (df,file): with fs.open(file,'wb') as f: df.to_csv(f, compression='gzip', index=False) This code saves the dataframe as a new object in S3 but in a plain CSV not in a gzip format. On the other hand, the read functionality it's working OK using this compression parameter. def s3CsvToDf(file):

How to read a compressed (gz) CSV file into a dask Dataframe?

橙三吉。 提交于 2020-01-03 08:41:34
问题 Is there a way to read a .csv file that is compressed via gz into a dask dataframe? I've tried it directly with import dask.dataframe as dd df = dd.read_csv("Data.gz" ) but get an unicode error (probably because it is interpreting the compressed bytes) There is a "compression" parameter but compression = "gz" won't work and I can't find any documentation so far. With pandas I can read the file directly without a problem other than the result blowing up my memory ;-) but if I restrict the

multiplication of large arrays in python

 ̄綄美尐妖づ 提交于 2020-01-02 23:12:53
问题 I have big arrays to multiply in large number of iterations also. I am training a model with array long around 1500 and I will perform 3 multiplications for about 1000000 times which takes a long time almost week. I found Dask I tried to compare it with the normal numpy way but I found numpy faster: x = np.arange(2000) start = time.time() y = da.from_array(x, chunks=(100)) for i in range (0,100): p = y.dot(y) #print(p) print( time.time() - start) print('------------------------------') start

Slicing out a few rows from a `dask.DataFrame`

好久不见. 提交于 2020-01-02 20:03:18
问题 Often, when working with a large dask.DataFrame , it would be useful to grab only a few rows on which to test all subsequent operations. Currently, according to Slicing a Dask Dataframe, this is unsupported. I was hoping to then use head to achieve the same (since that command is supported), but that returns a regular pandas DataFrame. I also tried df[:1000] , which executes, but generates an output different from that you'd expect from Pandas. Is there any way to grab the first 1000 rows

Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement

我只是一个虾纸丫 提交于 2020-01-02 03:15:10
问题 I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls: DataFrame definition import pandas as pd import dask.dataframe as dd df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, "", 0.345, 0.40, 0.15]}) ddf = dd.from_pandas(df1, npartitions=2) Method-1 tried def funcUpdate(row): if row['y'].isnull(): return row['y'] else: return round((1 + row['x'])/(1+ 1/row['y']),4) ddf = ddf.assign

Default pip installation of Dask gives “ImportError: No module named toolz”

 ̄綄美尐妖づ 提交于 2020-01-02 00:51:26
问题 I installed Dask using pip like this: pip install dask and when I try to do import dask.dataframe as dd I get the following error message: >>> import dask.dataframe as dd Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/path/to/venv/lib/python2.7/site-packages/dask/__init__.py", line 5, in <module> from .async import get_sync as get File "/path/to/venv/lib/python2.7/site-packages/dask/async.py", line 120, in <module> from toolz import identity ImportError: No

On Dask DataFrame.apply(), receiving n rows of value 1 before actual rows processed

风格不统一 提交于 2019-12-30 18:59:49
问题 In the below code snippet, I would expect the logs to print the numbers 0 - 4. I understand that the numbers may not be in that order, as the task would be broken up into a number of parallel operations. Code snippet: from dask import dataframe as dd import numpy as np import pandas as pd df = pd.DataFrame({'A': np.arange(5), 'B': np.arange(5), 'C': np.arange(5)}) ddf = dd.from_pandas(df, npartitions=1) def aggregate(x): print('B val received: ' + str(x.B)) return x ddf.apply(aggregate, axis

Dask rolling function by group syntax

这一生的挚爱 提交于 2019-12-30 11:31:54
问题 I struggled for a while with the syntax to work for calculating a rolling function by group for a dask dataframe. The documentation is excellent, but in this case does not have an example. The working version I have is as follows, from a csv that contains a text field with User ids and and x, y, and z column: ddf = read_csv('./*.csv') ddf.groupby(ddf.User).x.apply(lambda x: x.rolling(5).mean(), meta=('x', 'f8')).compute() Is this the recommended syntax for rolling functions applied by group

Dask rolling function by group syntax

若如初见. 提交于 2019-12-30 11:31:25
问题 I struggled for a while with the syntax to work for calculating a rolling function by group for a dask dataframe. The documentation is excellent, but in this case does not have an example. The working version I have is as follows, from a csv that contains a text field with User ids and and x, y, and z column: ddf = read_csv('./*.csv') ddf.groupby(ddf.User).x.apply(lambda x: x.rolling(5).mean(), meta=('x', 'f8')).compute() Is this the recommended syntax for rolling functions applied by group

dask dataframe how to convert column to to_datetime

天涯浪子 提交于 2019-12-29 04:18:09
问题 I am trying to convert one column of my dataframe to datetime. Following the discussion here https://github.com/dask/dask/issues/863 I tried the following code: import dask.dataframe as dd df['time'].map_partitions(pd.to_datetime, columns='time').compute() But I am getting the following error message ValueError: Metadata inference failed, please provide `meta` keyword What exactly should I put under meta? should I put a dictionary of ALL the columns in df or only of the 'time' column? and