dask | 易学教程

Why do pandas and dask perform better when importing from CSV compared to HDF5?

阅读更多关于 Why do pandas and dask perform better when importing from CSV compared to HDF5?

I am working with a system that currently operates with large (>5GB) .csv files. To increase performance, I am testing (A) different methods to create dataframes from disk (pandas VS dask ) as well as (B) different ways to store results to disk (.csv VS hdf5 files). In order to benchmark performance, I did the following: def dask_read_from_hdf(): results_dd_hdf = dd.read_hdf('store.h5', key='period1', columns = ['Security']) analyzed_stocks_dd_hdf = results_dd_hdf.Security.unique() hdf.close() def pandas_read_from_hdf(): results_pd_hdf = pd.read_hdf('store.h5', key='period1', columns = [

move from pandas to dask to utilize all local cpu cores

阅读更多关于 move from pandas to dask to utilize all local cpu cores

Recently I stumbled upon http://dask.pydata.org/en/latest/ As I have some pandas code which only runs on a single core I wonder how to make use of my other CPU cores. Would dask work well to use all (local) CPU cores? If yes how compatible is it to pandas? Could I use multiple CPUs with pandas? So far I read about releasing the GIL but that all seems rather complicated. Would dask work well to use all (local) CPU cores? Yes. how compatible is it to pandas? Pretty compatible. Not 100%. You can mix in Pandas and NumPy and even pure Python stuff with Dask if needed. Could I use multiple CPUs with

Slicing a Dask Dataframe

阅读更多关于 Slicing a Dask Dataframe

I have the following code where I like to do a train/test split on a Dask dataframe df = dd.read_csv(csv_filename, sep=',', encoding="latin-1", names=cols, header=0, dtype='str') But when I try to do slices like for train, test in cv.split(X, y): df.fit(X[train], y[train]) it fails with the error KeyError: '[11639 11641 11642 ..., 34997 34998 34999] not in index' Any ideas? Dask.dataframe doesn't support row-wise slicing. It does support the loc operation if you have a sensible index. However in your case of train/test splitting you will probably be better served by the random_split method.

Item assignment to Python dask array objects

阅读更多关于 Item assignment to Python dask array objects

问题 I've created a Python dask array and I'm trying to modify a slice of the array as follows: import numpy as np import dask.array as da x = np.random.random((20000, 100, 100)) # Create numpy array dx = da.from_array(x, chunks=(x.shape[0], 10, 10)) # Create dask array from numpy array dx[:50, :, :] = 0 # Modify a slice of the dask array Such an attempt to modify the dask array raises the exception: TypeError: 'Array' object does not support item assignment Is there a way to modify a dask array

Strategy for partitioning dask dataframes efficiently

阅读更多关于 Strategy for partitioning dask dataframes efficiently

问题 The documentation for Dask talks about repartioning to reduce overhead here. They however seem to indicate you need some knowledge of what your dataframe will look like beforehand (ie that there will 1/100th of the data expected). Is there a good way to repartition sensibly without making assumptions? At the moment I just repartition with npartitions = ncores * magic_number , and set force to True to expand partitions if need be. This one size fits all approach works but is definitely

How to efficiently send a large numpy array to the cluster with Dask.array

阅读更多关于 How to efficiently send a large numpy array to the cluster with Dask.array

I have a large NumPy array on my local machine that I want to parallelize with Dask.array on a cluster import numpy as np x = np.random.random((1000, 1000, 1000)) However when I use dask.array I find that my scheduler starts taking up a lot of RAM. Why is this? Shouldn't this data go to the workers? import dask.array as da x = da.from_array(x, chunks=(100, 100, 100)) from dask.distributed import Client client = Client(...) x = x.persist() Whenever you persist or compute a Dask collection that data goes to the scheduler, and from there to the workers. If you want to bypass storing data on the

Item assignment to Python dask array objects

阅读更多关于 Item assignment to Python dask array objects

I've created a Python dask array and I'm trying to modify a slice of the array as follows: import numpy as np import dask.array as da x = np.random.random((20000, 100, 100)) # Create numpy array dx = da.from_array(x, chunks=(x.shape[0], 10, 10)) # Create dask array from numpy array dx[:50, :, :] = 0 # Modify a slice of the dask array Such an attempt to modify the dask array raises the exception: TypeError: 'Array' object does not support item assignment Is there a way to modify a dask array slice without raising an exception? Currently dask.array does not support item assignment or any other

Strategy for partitioning dask dataframes efficiently

阅读更多关于 Strategy for partitioning dask dataframes efficiently

The documentation for Dask talks about repartioning to reduce overhead here . They however seem to indicate you need some knowledge of what your dataframe will look like beforehand (ie that there will 1/100th of the data expected). Is there a good way to repartition sensibly without making assumptions? At the moment I just repartition with npartitions = ncores * magic_number , and set force to True to expand partitions if need be. This one size fits all approach works but is definitely suboptimal as my dataset varies in size. The data is time series data, but unfortunately not at regular

Can dask parralelize reading fom a csv file?

阅读更多关于 Can dask parralelize reading fom a csv file?

I'm converting a large textfile to a hdf storage in hopes of a faster data access. The conversion works allright, however reading from the csv file is not done in parallel. It is really slow (takes about 30min for a 1GB textfile on an SSD, so my guess is that it is not IO-bound). Is there a way to have it read in multiple threads in parralel? Sice it might be important, I'm currently forced to run under Windows -- just in case that makes any difference. from dask import dataframe as ddf df = ddf.read_csv("data/Measurements*.csv", sep=';', parse_dates=["DATETIME"], blocksize=1000000, ) df

Dask read_sql_table errors out when using an SQLAlchemy expression

阅读更多关于 Dask read_sql_table errors out when using an SQLAlchemy expression

问题 I'm trying to use an SQLAlchemy expression with dask's read_sql_table in order to bring down a dataset that is created by joining and filtering a few different tables. The documentation indicates that this should be possible. (The example below, does not include any joins as they are not needed to replicate the problem.) I build my connection string, create an SQLAlchemy engine and table corresponding to a table in my database. (I'm using PostgreSQL.) import dask.dataframe as dd import pandas