dask

Airflow + celery or dask. For what, when?

為{幸葍}努か 提交于 2019-12-04 17:44:17
问题 I read in the official Airflow documentation the following: What does this mean exactly? What do the authors mean by scaling out? That is, when is it not enough to use Airflow or when would anyone use Airflow in combination with something like Celery? (same for dask ) 回答1: In Airflow terminology an "Executor" is the component responsible for running your task. The LocalExecutor does this by spawning threads on the computer Airflow runs on and lets the thread execute the task. Naturally your

Row by row processing of a Dask DataFrame

白昼怎懂夜的黑 提交于 2019-12-04 15:08:58
I need to process a large file and to change some values. I would like to do something like that: for index, row in dataFrame.iterrows(): foo = doSomeStuffWith(row) lol = doOtherStuffWith(row) dataFrame['colx'][index] = foo dataFrame['coly'][index] = lol Bad for me, I cannot do dataFrame['colx'][index] = foo ! My number of row is quite large and I need to process a large number of column. So I'm afraid that dask may read the file several times if I do one dataFrame.apply(...) for each column. Other solutions are to manually break my data into chunks and to use pandas or to just throw anything

Read a large csv into a sparse pandas dataframe in a memory efficient way

百般思念 提交于 2019-12-04 09:57:44
问题 The pandas read_csv function doesn't seem to have a sparse option. I have csv data with a ton of zeros in it (it compresses very well, and stripping out any 0 value reduces it to almost half the original size). I've tried loading it into a dense matrix first with read_csv and then calling to_sparse , but it takes a long time and chokes on text fields, although most of the data is floating point. If I call pandas.get_dummies(df) first to convert the categorical columns to ones & zeros, then

dask DataFrame equivalent of pandas DataFrame sort_values

怎甘沉沦 提交于 2019-12-04 08:31:41
What would be the equivalent of sort_values in pandas for a dask DataFrame ? I am trying to scale some Pandas code which has memory issues to use a dask DataFrame instead. Would the equivalent be : ddf.set_index([col1, col2], sorted=True) ? Sorting in parallel is hard. You have two options in Dask.dataframe set_index As now, you can call set_index with a single column index: In [1]: import pandas as pd In [2]: import dask.dataframe as dd In [3]: df = pd.DataFrame({'x': [3, 2, 1], 'y': ['a', 'b', 'c']}) In [4]: ddf = dd.from_pandas(df, npartitions=2) In [5]: ddf.set_index('x').compute() Out[5]:

Python PANDAS: Converting from pandas/numpy to dask dataframe/array

旧城冷巷雨未停 提交于 2019-12-04 07:36:22
I am working to try to convert a program to be parallelizable/multithreaded with the excellent dask library. Here is the program I am working on converting: Python PANDAS: Stack by Enumerated Date to Create Records Vectorized import pandas as pd import numpy as np import dask.dataframe as dd import dask.array as da from io import StringIO test_data = '''id,transaction_dt,units,measures 1,2018-01-01,4,30.5 1,2018-01-03,4,26.3 2,2018-01-01,3,12.7 2,2018-01-03,3,8.8''' df_test = pd.read_csv(StringIO(test_data), sep=',') df_test['transaction_dt'] = pd.to_datetime(df_test['transaction_dt']) df_test

How to do row processing and item assignment in Dask

自闭症网瘾萝莉.ら 提交于 2019-12-04 05:27:22
问题 Similar unanswered question: Row by row processing of a Dask DataFrame I'm working with dataframes that are millions on rows long, and so now I'm trying to have all dataframe operations performed in parallel. One such operation I need converted to Dask is: for row in df.itertuples(): ratio = row.ratio tmpratio = row.tmpratio tmplabel = row.tmplabel if tmpratio > ratio: df.loc[row.Index,'ratio'] = tmpratio df.loc[row.Index,'label'] = tmplabel What is the appropriate way to set a value by index

What is the role of npartitions in a Dask dataframe?

一世执手 提交于 2019-12-04 03:47:20
I see the paramter npartitions in many functions, but I don't understand what it is good for / used for. http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv head(...) Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions. repartition(...) Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified. Is the number of partitions probably 5 in this case: (Image source

Can I use functions imported from .py files in Dask/Distributed?

自古美人都是妖i 提交于 2019-12-04 01:55:40
I have a question about serialization and imports. should functions have their own imports? like I've seen done with PySpark Is the following just plain wrong? Does mod.py need to be a conda/pip package? mod.py was written to a shared filesystem. In [1]: from distributed import Executor In [2]: e = Executor('127.0.0.1:8786') In [3]: e Out[3]: <Executor: scheduler="127.0.0.1:8786" processes=2 cores=2> In [4]: import socket In [5]: e.run(socket.gethostname) Out[5]: {'172.20.12.7:53405': 'n1015', '172.20.12.8:53779': 'n1016'} In [6]: %%file mod.py ...: def hostname(): ...: return 'the hostname' .

How do I run a dask.distributed cluster in a single thread?

半腔热情 提交于 2019-12-04 01:21:33
How can I run a complete Dask.distributed cluster in a single thread? I want to use this for debugging or profiling. Note: this is a frequently asked question. I'm adding the question and answer here to Stack Overflow just for future reuse. Local Scheduler If you can get by with the single-machine scheduler's API (just compute) then you can use the single-threaded scheduler x.compute(scheduler='single-threaded') Distributed Scheduler - Single Machine If you want to run a dask.distributed cluster on a single machine you can start the client with no arguments from dask.distributed import Client

Best practices in setting number of dask workers

半城伤御伤魂 提交于 2019-12-04 00:09:41
I am a bit confused by the different terms used in dask and dask.distributed when setting up workers on a cluster. The terms I came across are: thread, process, processor, node, worker, scheduler. My question is how to set the number of each, and if there is a strict or recommend relationship between any of these. For example: 1 worker per node with n processes for the n cores on the node threads and processes are the same concept? In dask-mpi I have to set nthreads but they show up as processes in the client Any other suggestions? By "node" people typically mean a physical or virtual machine.