dask

How should I get the shape of a dask dataframe?

六眼飞鱼酱① 提交于 2019-12-01 14:27:58
问题 Performing .shape is giving me the following error. AttributeError: 'DataFrame' object has no attribute 'shape' How should I get the shape instead? 回答1: You can get the number of columns directly len(df.columns) # this is fast You can also call len on the dataframe itself, though beware that this will trigger a computation. len(df) # this requires a full scan of the data Dask.dataframe doesn't know how many records are in your data without first reading through all of it. 回答2: To get the

Add a value to a column of DASK data-frames imported using csv_read

倖福魔咒の 提交于 2019-12-01 12:38:01
Suppose that five files are imported to the DASK using csv_read . To do this, I use this code: import dask.dataframe as dd data = dd.read_csv(final_file_list_msg, header = None) Every file has ten columns. I want to add 1 to the first column of file 1, 2 to the first column of file 2, 3 to the first column of file 3, etc. Let assume that you have several files following this scheme: dummy/ ├── file01.csv ├── file02.csv ├── file03.csv First we create them via import os import pandas as pd import numpy as np import dask.dataframe as dd from dask import delayed fldr = "dummy" if not os.path

Dask rolling function by group syntax

别来无恙 提交于 2019-12-01 10:28:27
I struggled for a while with the syntax to work for calculating a rolling function by group for a dask dataframe. The documentation is excellent, but in this case does not have an example. The working version I have is as follows, from a csv that contains a text field with User ids and and x, y, and z column: ddf = read_csv('./*.csv') ddf.groupby(ddf.User).x.apply(lambda x: x.rolling(5).mean(), meta=('x', 'f8')).compute() Is this the recommended syntax for rolling functions applied by group within dask DataFrames, or is there a recommended alternative? In order to retain the groups in the

duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe

爱⌒轻易说出口 提交于 2019-12-01 10:12:16
Following on from this question, when I try to create a postgresql table from a dask.dataframe with more than one partition I get the following error: IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint "pg_type_typname_nsp_index" DETAIL: Key (typname, typnamespace)=(test1, 2200) already exists. [SQL: '\nCREATE TABLE test1 (\n\t"A" BIGINT, \n\t"B" BIGINT, \n\t"C" BIGINT, \n\t"D" BIGINT, \n\t"E" BIGINT, \n\t"F" BIGINT, \n\t"G" BIGINT, \n\t"H" BIGINT, \n\t"I" BIGINT, \n\t"J" BIGINT, \n\tidx BIGINT\n)\n\n'] You can recreate the error with the following code:

Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

风流意气都作罢 提交于 2019-12-01 09:50:30
Update: The pandas df was created like this: df = pd.read_sql(query, engine) encoded = pd.get_dummies(df, columns=['account']) Creating a dask df from this df looks like this: df = dd.from_pandas(encoded, 50) Performing the operation with dask results in no visible progress being made (checking with dask diagnostics): result = df.groupby('journal_entry').max().reset_index().compute() Original: I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype uint8. The uint8 columns only hold values of 1 or 0. I am attempting to perform this operation on the

out of core 4D image tif storage as hdf5 python

笑着哭i 提交于 2019-12-01 09:36:18
I have 27GB of 2D tiff files that represent slices of a movie of 3D images. I want to be able to slice this data as if it were a simple numpy4d array. It looks like dask.array is a good tool for cleanly manipulating the array once it's stored in memory as a hdf5 file. How can I store these files as an hdf5 file in the first place if they do not all fit into memory. I am new to h5.py and databases in general. Thanks. Edit: Use dask.array 's imread function As of dask 0.7.0 you don't need to store your images in HDF5. Use the imread function directly instead: In [1]: from skimage.io import

Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

寵の児 提交于 2019-12-01 07:47:05
问题 Update: The pandas df was created like this: df = pd.read_sql(query, engine) encoded = pd.get_dummies(df, columns=['account']) Creating a dask df from this df looks like this: df = dd.from_pandas(encoded, 50) Performing the operation with dask results in no visible progress being made (checking with dask diagnostics): result = df.groupby('journal_entry').max().reset_index().compute() Original: I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype

duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe

不羁岁月 提交于 2019-12-01 07:09:15
问题 Following on from this question, when I try to create a postgresql table from a dask.dataframe with more than one partition I get the following error: IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint "pg_type_typname_nsp_index" DETAIL: Key (typname, typnamespace)=(test1, 2200) already exists. [SQL: '\nCREATE TABLE test1 (\n\t"A" BIGINT, \n\t"B" BIGINT, \n\t"C" BIGINT, \n\t"D" BIGINT, \n\t"E" BIGINT, \n\t"F" BIGINT, \n\t"G" BIGINT, \n\t"H" BIGINT, \n\t"I

why is dot product in dask slower than in numpy

落爺英雄遲暮 提交于 2019-12-01 06:51:04
a dot product in dask seems to run much slower than in numpy: import numpy as np x_np = np.random.normal(10, 0.1, size=(1000,100)) y_np = x_np.transpose() %timeit x_np.dot(y_np) # 100 loops, best of 3: 7.17 ms per loop import dask.array as da x_dask = da.random.normal(10, 0.1, size=(1000,100), chunks=(5,5)) y_dask = x_dask.transpose() %timeit x_dask.dot(y_dask) # 1 loops, best of 3: 6.56 s per loop Does anybody know what might be the reason for that? Is there anything I'm missing here? Adjust chunk sizes The answer by @isternberg is correct that you should adjust chunk sizes. A good choice of

out of core 4D image tif storage as hdf5 python

别等时光非礼了梦想. 提交于 2019-12-01 06:18:56
问题 I have 27GB of 2D tiff files that represent slices of a movie of 3D images. I want to be able to slice this data as if it were a simple numpy4d array. It looks like dask.array is a good tool for cleanly manipulating the array once it's stored in memory as a hdf5 file. How can I store these files as an hdf5 file in the first place if they do not all fit into memory. I am new to h5.py and databases in general. Thanks. 回答1: Edit: Use dask.array 's imread function As of dask 0.7.0 you don't need