dask | 易学教程

Airflow + celery or dask. For what, when?

阅读更多关于 Airflow + celery or dask. For what, when?

问题 I read in the official Airflow documentation the following: What does this mean exactly? What do the authors mean by scaling out? That is, when is it not enough to use Airflow or when would anyone use Airflow in combination with something like Celery? (same for dask ) 回答1: In Airflow terminology an "Executor" is the component responsible for running your task. The LocalExecutor does this by spawning threads on the computer Airflow runs on and lets the thread execute the task. Naturally your

Row by row processing of a Dask DataFrame

阅读更多关于 Row by row processing of a Dask DataFrame

I need to process a large file and to change some values. I would like to do something like that: for index, row in dataFrame.iterrows(): foo = doSomeStuffWith(row) lol = doOtherStuffWith(row) dataFrame['colx'][index] = foo dataFrame['coly'][index] = lol Bad for me, I cannot do dataFrame['colx'][index] = foo ! My number of row is quite large and I need to process a large number of column. So I'm afraid that dask may read the file several times if I do one dataFrame.apply(...) for each column. Other solutions are to manually break my data into chunks and to use pandas or to just throw anything

Read a large csv into a sparse pandas dataframe in a memory efficient way

阅读更多关于 Read a large csv into a sparse pandas dataframe in a memory efficient way

问题 The pandas read_csv function doesn't seem to have a sparse option. I have csv data with a ton of zeros in it (it compresses very well, and stripping out any 0 value reduces it to almost half the original size). I've tried loading it into a dense matrix first with read_csv and then calling to_sparse , but it takes a long time and chokes on text fields, although most of the data is floating point. If I call pandas.get_dummies(df) first to convert the categorical columns to ones & zeros, then

dask DataFrame equivalent of pandas DataFrame sort_values

阅读更多关于 dask DataFrame equivalent of pandas DataFrame sort_values

What would be the equivalent of sort_values in pandas for a dask DataFrame ? I am trying to scale some Pandas code which has memory issues to use a dask DataFrame instead. Would the equivalent be : ddf.set_index([col1, col2], sorted=True) ? Sorting in parallel is hard. You have two options in Dask.dataframe set_index As now, you can call set_index with a single column index: In [1]: import pandas as pd In [2]: import dask.dataframe as dd In [3]: df = pd.DataFrame({'x': [3, 2, 1], 'y': ['a', 'b', 'c']}) In [4]: ddf = dd.from_pandas(df, npartitions=2) In [5]: ddf.set_index('x').compute() Out[5]:

Python PANDAS: Converting from pandas/numpy to dask dataframe/array

阅读更多关于 Python PANDAS: Converting from pandas/numpy to dask dataframe/array

I am working to try to convert a program to be parallelizable/multithreaded with the excellent dask library. Here is the program I am working on converting: Python PANDAS: Stack by Enumerated Date to Create Records Vectorized import pandas as pd import numpy as np import dask.dataframe as dd import dask.array as da from io import StringIO test_data = '''id,transaction_dt,units,measures 1,2018-01-01,4,30.5 1,2018-01-03,4,26.3 2,2018-01-01,3,12.7 2,2018-01-03,3,8.8''' df_test = pd.read_csv(StringIO(test_data), sep=',') df_test['transaction_dt'] = pd.to_datetime(df_test['transaction_dt']) df_test

How to do row processing and item assignment in Dask

阅读更多关于 How to do row processing and item assignment in Dask

问题 Similar unanswered question: Row by row processing of a Dask DataFrame I'm working with dataframes that are millions on rows long, and so now I'm trying to have all dataframe operations performed in parallel. One such operation I need converted to Dask is: for row in df.itertuples(): ratio = row.ratio tmpratio = row.tmpratio tmplabel = row.tmplabel if tmpratio > ratio: df.loc[row.Index,'ratio'] = tmpratio df.loc[row.Index,'label'] = tmplabel What is the appropriate way to set a value by index

What is the role of npartitions in a Dask dataframe?

阅读更多关于 What is the role of npartitions in a Dask dataframe?

I see the paramter npartitions in many functions, but I don't understand what it is good for / used for. http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv head(...) Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions. repartition(...) Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified. Is the number of partitions probably 5 in this case: (Image source

Can I use functions imported from .py files in Dask/Distributed?

阅读更多关于 Can I use functions imported from .py files in Dask/Distributed?

I have a question about serialization and imports. should functions have their own imports? like I've seen done with PySpark Is the following just plain wrong? Does mod.py need to be a conda/pip package? mod.py was written to a shared filesystem. In [1]: from distributed import Executor In [2]: e = Executor('127.0.0.1:8786') In [3]: e Out[3]: <Executor: scheduler="127.0.0.1:8786" processes=2 cores=2> In [4]: import socket In [5]: e.run(socket.gethostname) Out[5]: {'172.20.12.7:53405': 'n1015', '172.20.12.8:53779': 'n1016'} In [6]: %%file mod.py ...: def hostname(): ...: return 'the hostname' .

How do I run a dask.distributed cluster in a single thread?

阅读更多关于 How do I run a dask.distributed cluster in a single thread?

How can I run a complete Dask.distributed cluster in a single thread? I want to use this for debugging or profiling. Note: this is a frequently asked question. I'm adding the question and answer here to Stack Overflow just for future reuse. Local Scheduler If you can get by with the single-machine scheduler's API (just compute) then you can use the single-threaded scheduler x.compute(scheduler='single-threaded') Distributed Scheduler - Single Machine If you want to run a dask.distributed cluster on a single machine you can start the client with no arguments from dask.distributed import Client

Best practices in setting number of dask workers

阅读更多关于 Best practices in setting number of dask workers

I am a bit confused by the different terms used in dask and dask.distributed when setting up workers on a cluster. The terms I came across are: thread, process, processor, node, worker, scheduler. My question is how to set the number of each, and if there is a strict or recommend relationship between any of these. For example: 1 worker per node with n processes for the n cores on the node threads and processes are the same concept? In dask-mpi I have to set nthreads but they show up as processes in the client Any other suggestions? By "node" people typically mean a physical or virtual machine.