dask | 易学教程

How should I get the shape of a dask dataframe?

阅读更多关于 How should I get the shape of a dask dataframe?

问题 Performing .shape is giving me the following error. AttributeError: 'DataFrame' object has no attribute 'shape' How should I get the shape instead? 回答1: You can get the number of columns directly len(df.columns) # this is fast You can also call len on the dataframe itself, though beware that this will trigger a computation. len(df) # this requires a full scan of the data Dask.dataframe doesn't know how many records are in your data without first reading through all of it. 回答2: To get the

Add a value to a column of DASK data-frames imported using csv_read

阅读更多关于 Add a value to a column of DASK data-frames imported using csv_read

Suppose that five files are imported to the DASK using csv_read . To do this, I use this code: import dask.dataframe as dd data = dd.read_csv(final_file_list_msg, header = None) Every file has ten columns. I want to add 1 to the first column of file 1, 2 to the first column of file 2, 3 to the first column of file 3, etc. Let assume that you have several files following this scheme: dummy/ ├── file01.csv ├── file02.csv ├── file03.csv First we create them via import os import pandas as pd import numpy as np import dask.dataframe as dd from dask import delayed fldr = "dummy" if not os.path

Dask rolling function by group syntax

阅读更多关于 Dask rolling function by group syntax

I struggled for a while with the syntax to work for calculating a rolling function by group for a dask dataframe. The documentation is excellent, but in this case does not have an example. The working version I have is as follows, from a csv that contains a text field with User ids and and x, y, and z column: ddf = read_csv('./*.csv') ddf.groupby(ddf.User).x.apply(lambda x: x.rolling(5).mean(), meta=('x', 'f8')).compute() Is this the recommended syntax for rolling functions applied by group within dask DataFrames, or is there a recommended alternative? In order to retain the groups in the

duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe

阅读更多关于 duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe

Following on from this question, when I try to create a postgresql table from a dask.dataframe with more than one partition I get the following error: IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint "pg_type_typname_nsp_index" DETAIL: Key (typname, typnamespace)=(test1, 2200) already exists. [SQL: '\nCREATE TABLE test1 (\n\t"A" BIGINT, \n\t"B" BIGINT, \n\t"C" BIGINT, \n\t"D" BIGINT, \n\t"E" BIGINT, \n\t"F" BIGINT, \n\t"G" BIGINT, \n\t"H" BIGINT, \n\t"I" BIGINT, \n\t"J" BIGINT, \n\tidx BIGINT\n)\n\n'] You can recreate the error with the following code:

Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

阅读更多关于 Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

Update: The pandas df was created like this: df = pd.read_sql(query, engine) encoded = pd.get_dummies(df, columns=['account']) Creating a dask df from this df looks like this: df = dd.from_pandas(encoded, 50) Performing the operation with dask results in no visible progress being made (checking with dask diagnostics): result = df.groupby('journal_entry').max().reset_index().compute() Original: I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype uint8. The uint8 columns only hold values of 1 or 0. I am attempting to perform this operation on the

out of core 4D image tif storage as hdf5 python

阅读更多关于 out of core 4D image tif storage as hdf5 python

I have 27GB of 2D tiff files that represent slices of a movie of 3D images. I want to be able to slice this data as if it were a simple numpy4d array. It looks like dask.array is a good tool for cleanly manipulating the array once it's stored in memory as a hdf5 file. How can I store these files as an hdf5 file in the first place if they do not all fit into memory. I am new to h5.py and databases in general. Thanks. Edit: Use dask.array 's imread function As of dask 0.7.0 you don't need to store your images in HDF5. Use the imread function directly instead: In [1]: from skimage.io import

Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

阅读更多关于 Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

问题 Update: The pandas df was created like this: df = pd.read_sql(query, engine) encoded = pd.get_dummies(df, columns=['account']) Creating a dask df from this df looks like this: df = dd.from_pandas(encoded, 50) Performing the operation with dask results in no visible progress being made (checking with dask diagnostics): result = df.groupby('journal_entry').max().reset_index().compute() Original: I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype

duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe

阅读更多关于 duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe

问题 Following on from this question, when I try to create a postgresql table from a dask.dataframe with more than one partition I get the following error: IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint "pg_type_typname_nsp_index" DETAIL: Key (typname, typnamespace)=(test1, 2200) already exists. [SQL: '\nCREATE TABLE test1 (\n\t"A" BIGINT, \n\t"B" BIGINT, \n\t"C" BIGINT, \n\t"D" BIGINT, \n\t"E" BIGINT, \n\t"F" BIGINT, \n\t"G" BIGINT, \n\t"H" BIGINT, \n\t"I

why is dot product in dask slower than in numpy

阅读更多关于 why is dot product in dask slower than in numpy

a dot product in dask seems to run much slower than in numpy: import numpy as np x_np = np.random.normal(10, 0.1, size=(1000,100)) y_np = x_np.transpose() %timeit x_np.dot(y_np) # 100 loops, best of 3: 7.17 ms per loop import dask.array as da x_dask = da.random.normal(10, 0.1, size=(1000,100), chunks=(5,5)) y_dask = x_dask.transpose() %timeit x_dask.dot(y_dask) # 1 loops, best of 3: 6.56 s per loop Does anybody know what might be the reason for that? Is there anything I'm missing here? Adjust chunk sizes The answer by @isternberg is correct that you should adjust chunk sizes. A good choice of

out of core 4D image tif storage as hdf5 python

阅读更多关于 out of core 4D image tif storage as hdf5 python

问题 I have 27GB of 2D tiff files that represent slices of a movie of 3D images. I want to be able to slice this data as if it were a simple numpy4d array. It looks like dask.array is a good tool for cleanly manipulating the array once it's stored in memory as a hdf5 file. How can I store these files as an hdf5 file in the first place if they do not all fit into memory. I am new to h5.py and databases in general. Thanks. 回答1: Edit: Use dask.array 's imread function As of dask 0.7.0 you don't need