dask | 易学教程

Can I use functions imported from .py files in Dask/Distributed?

阅读更多关于 Can I use functions imported from .py files in Dask/Distributed?

问题 I have a question about serialization and imports. should functions have their own imports? like I've seen done with PySpark Is the following just plain wrong? Does mod.py need to be a conda/pip package? mod.py was written to a shared filesystem. In [1]: from distributed import Executor In [2]: e = Executor('127.0.0.1:8786') In [3]: e Out[3]: <Executor: scheduler="127.0.0.1:8786" processes=2 cores=2> In [4]: import socket In [5]: e.run(socket.gethostname) Out[5]: {'172.20.12.7:53405': 'n1015'

How to specify the number of threads/processes for the default dask scheduler

阅读更多关于 How to specify the number of threads/processes for the default dask scheduler

问题 Is there a way to limit the number of cores used by the default threaded scheduler (default when using dask dataframes)? With compute , you can specify it by using: df.compute(get=dask.threaded.get, num_workers=20) But I was wondering if there is a way to set this as the default, so you don't need to specify this for each compute call? The would eg be interesting in the case of a small cluster (eg of 64 cores), but which is shared with other people (without a job system), and I don't want to

Best practices in setting number of dask workers

阅读更多关于 Best practices in setting number of dask workers

问题 I am a bit confused by the different terms used in dask and dask.distributed when setting up workers on a cluster. The terms I came across are: thread, process, processor, node, worker, scheduler. My question is how to set the number of each, and if there is a strict or recommend relationship between any of these. For example: 1 worker per node with n processes for the n cores on the node threads and processes are the same concept? In dask-mpi I have to set nthreads but they show up as

Python Dask - vertical concatenation of 2 DataFrames

阅读更多关于 Python Dask - vertical concatenation of 2 DataFrames

问题 I am trying to vertically concatenate two Dask DataFrames I have the following Dask DataFrame: d = [ ['A','B','C','D','E','F'], [1, 4, 8, 1, 3, 5], [6, 6, 2, 2, 0, 0], [9, 4, 5, 0, 6, 35], [0, 1, 7, 10, 9, 4], [0, 7, 2, 6, 1, 2] ] df = pd.DataFrame(d[1:], columns=d[0]) ddf = dd.from_pandas(df, npartitions=5) Here is the data as a Pandas DataFrame A B C D E F 0 1 4 8 1 3 5 1 6 6 2 2 0 0 2 9 4 5 0 6 35 3 0 1 7 10 9 4 4 0 7 2 6 1 2 Here is the Dask DataFrame Dask DataFrame Structure: A B C D E F

How to efficiently send a large numpy array to the cluster with Dask.array

阅读更多关于 How to efficiently send a large numpy array to the cluster with Dask.array

问题 I have a large NumPy array on my local machine that I want to parallelize with Dask.array on a cluster import numpy as np x = np.random.random((1000, 1000, 1000)) However when I use dask.array I find that my scheduler starts taking up a lot of RAM. Why is this? Shouldn't this data go to the workers? import dask.array as da x = da.from_array(x, chunks=(100, 100, 100)) from dask.distributed import Client client = Client(...) x = x.persist() 回答1: Whenever you persist or compute a Dask collection

Can dask parralelize reading fom a csv file?

阅读更多关于 Can dask parralelize reading fom a csv file?

问题 I'm converting a large textfile to a hdf storage in hopes of a faster data access. The conversion works allright, however reading from the csv file is not done in parallel. It is really slow (takes about 30min for a 1GB textfile on an SSD, so my guess is that it is not IO-bound). Is there a way to have it read in multiple threads in parralel? Sice it might be important, I'm currently forced to run under Windows -- just in case that makes any difference. from dask import dataframe as ddf df =

Workaround for Item assignment not supported in dask

阅读更多关于 Workaround for Item assignment not supported in dask

问题 I am trying to convert my code base from numpy array to dask because my numpy arrays are exceeding the Memory Error limit. But, I come to know that the feature of mutable arrays are not yet implemented in dask arrays so I am getting NotImplementedError: Item assignment with <class 'tuple'> not supported Is there any workaround for my code below- for i, mask in enumerate(masks): bounds = find_boundaries(mask, mode='inner') X2, Y2 = np.nonzero(bounds) X2 = da.from_array(X2, 'auto') Y2 = da.from

Export dask groups to csv

阅读更多关于 Export dask groups to csv

问题 I have a single, large, file. It has 40,955,924 lines and is >13GB. I need to be able to separate this file out into individual files based on a single field, if I were using a pd.DataFrame I would use this: for k, v in df.groupby(['id']): v.to_csv(k, sep='\t', header=True, index=False) However, I get the error KeyError: 'Column not found: 0' there is a solution to this specific error on Iterate over GroupBy object in dask, but this requires using pandas to store a copy of the dataframe,

Converting Dask Scalar to integer value (or save it to text file)

阅读更多关于 Converting Dask Scalar to integer value (or save it to text file)

问题 I have calculated using dask by from dask import dataframe all_data = dataframe.read_csv(path) total_sum = all_data.account_balance.sum() The csv file has a column named account_balance . The total_sum is a dd.Scalar object, which seems to be difficult to change it to integer. How to get the integer version of it? or save it in a .txt file containing the number is also ok. I have also tried total_sum.compute() . Thanks. 回答1: .compute() does indeed bring you a real number, as you can see in

Add a value to a column of DASK data-frames imported using csv_read

阅读更多关于 Add a value to a column of DASK data-frames imported using csv_read

问题 Suppose that five files are imported to the DASK using csv_read . To do this, I use this code: import dask.dataframe as dd data = dd.read_csv(final_file_list_msg, header = None) Every file has ten columns. I want to add 1 to the first column of file 1, 2 to the first column of file 2, 3 to the first column of file 3, etc. 回答1: Let assume that you have several files following this scheme: dummy/ ├── file01.csv ├── file02.csv ├── file03.csv First we create them via import os import pandas as pd