dask

Using Dask compute causes execution to hang

大兔子大兔子 提交于 2019-12-12 04:57:15
问题 This is a follow up question to a potential answer to one of my previous questions on using Dask computed to access one element in a large array . Why does using Dask compute cause the execution to hang below? Here's the working code snippet: #Suppose you created a scheduler at the ip address of 111.111.11.11:8786 from dask.distributed import Client import dask.array as da # client1 client1 = Client("111.111.11.11:8786") x = da.ones(10000000, chunks=(100000,)) # 1e7 size array cut into 1e5

Dask Bag read_text() line order

老子叫甜甜 提交于 2019-12-12 03:43:30
问题 Does dask.bag.read_text() preserve the line order? Is it still preserved when reading from multiple files? bag = db.read_text('program.log') bag = db.read_text(['program.log', 'program.log.1']) 回答1: Informally, yes, most Dask.bag operations do preserve order. This behavior is not strictly guaranteed, however I don't see any reason to anticipate a change in the near future. 来源: https://stackoverflow.com/questions/39652733/dask-bag-read-text-line-order

Send SIGTERM to the running task, dask distributed

别等时光非礼了梦想. 提交于 2019-12-12 01:13:13
问题 When I submit a small Tensorflow training as a single task, it launches additional threads. When I press Ctrl+C and raise KeyboardInterrupt my task is closed but underlying threads are not cleaned up and training continues. Initially, I was thinking that this is a problem of Tensorflow (not cleaning threads), but after testing, I understand that a problem comes from the Dask side, that probably doesn't populate SIGTERM signal further to the task function. My question, how can I set Dask to

In pandas Parallel processing using Dask

∥☆過路亽.° 提交于 2019-12-11 17:35:46
问题 i want to reduce the time of process in pandas. i tried to make smaller pandas memory using .cat method and tried multiprocessing but there is no change in time import multiprocessing import time import pandas as pd start=time.time() def square(df1): df1['M_threading'] = df1['M_Invoice_type'] def multiply(df4): df4['M_threading'] = df4['M_Invoice_type'] if __name__ == '__main__': df = pd.read_excel("C:/Users/Admin/Desktop/schindler purchase Apr-19.xlsx") df1 = df.loc[df['M_Invoice_type'] ==

How to capture logs from workers from a Dask-Yarn job?

走远了吗. 提交于 2019-12-11 17:14:27
问题 I have tried using the following in ~/.config/dask/distributed.yaml and ~/.config/dask/yarn.yaml , logging-file-config: "/path/to/config.ini" or logging: version: 1 disable_existing_loggers: false root: level: INFO handlers: [consoleHandler] handlers: consoleHandler: class: logging.StreamHandler level: INFO formatter: sample_formatter stream: ext://sys.stderr formatters: sample_formatter: format: '%(asctime)s - %(name)s - %(levelname)s - %(message)s' and then in my function that gets

Cross read parquet files between R and Python

自闭症网瘾萝莉.ら 提交于 2019-12-11 17:06:23
问题 We have generated a parquet files, one in Dask (Python) and another with R Drill (using the Sergeant packet ). They use a different implementations of parquet see my other parquet question We are not able to cross read the files (the python can't read the R file and vice versa). When reading the Python parquet file in the R environment we receive the following error: system error: Illegalstatexception: UTF8 can only annotate binary filed . When reading the R/Drill parquet file in Dask we get

Python Dask map_partitions

天涯浪子 提交于 2019-12-11 16:18:12
问题 Probably a continuation of this question, working from the dask docs examples for map_partitions. import dask.dataframe as dd df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [1., 2., 3., 4., 5.]}) ddf = dd.from_pandas(df, npartitions=2) from random import randint def myadd(df): new_value = df.x + randint(1,4) return new_value res = ddf.map_partitions(lambda df: df.assign(z=myadd)).compute() res In the above code, randint is only being called once, not once per row as I would expect. How come?

Distributing graphs to across cluster nodes

南楼画角 提交于 2019-12-11 13:44:56
问题 I'm making good progress with Dask.delayed. As a group, we've decided to put more time in working with graphs using Dask. I have a question about distribution. I'm seeing the following behaviour on our cluster. I start up e.g. 8 workers on each of 8 nodes each with 4 threads, say/ I then client.compute 8 graphs to create the simulated data for subsequent processing. I want to have the 8 data sets generated one per node. However, what seems to happen is, not unreasonably, the eight functions

Dask Dataframe sum of column always returning scalar [duplicate]

一曲冷凌霜 提交于 2019-12-11 10:16:13
问题 This question already has an answer here : Converting Dask Scalar to integer value (or save it to text file) (1 answer) Closed last year . I've created a Dask Dataframe (called "df") and the column with index "11" has integer values: In [62]: df[11] Out[62]: Dask Series Structure: npartitions=42 int64 ... ... ... ... Name: 11, dtype: int64 Dask Name: getitem, 168 tasks I'm trying to sum these with: df[11].sum() I get dd.Scalar<series-..., dtype=int64> returned. Despite researching what this

Workaround for Item assignment not supported in dask

穿精又带淫゛_ 提交于 2019-12-11 08:13:03
问题 I am trying to convert my code base from numpy array to dask because my numpy arrays are exceeding the Memory Error limit. But, I come to know that the feature of mutable arrays are not yet implemented in dask arrays so I am getting NotImplementedError: Item assignment with <class 'tuple'> not supported Is there any workaround for my code below- for i, mask in enumerate(masks): bounds = find_boundaries(mask, mode='inner') X2, Y2 = np.nonzero(bounds) X2 = da.from_array(X2, 'auto') Y2 = da.from