dask

使用Python玩转GPU

旧街凉风 提交于 2020-05-04 09:32:31
问题 随着机器学习对模型运算速度的需求越来越强烈, 一直想进行GPU编程,但一直以来这些都是c++的专利 一想到c++里的各种坑,就提不起劲来,毕竟这样来来回回填坑的投入产出,生产效率就会大打折扣 解决方案 让人欣喜的是,随着Python阵营的不断发展壮大,使用python进行GPU编程也越来越便捷了 那么具体有些什么样的包,能针对GPU做些啥事呢? 看看一些具体的代码,就能大概明白: 首先是pycuda,这是它的一个例子: mod = SourceModule(""" __global__ void multiply_them(float *dest, float *a, float *b) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } """) 由上面的代码我们可以看出,pycuda将调用gpu的c++代码做了包装,可以在python里直接使用 再看看numba: @cuda.jit def increment_by_one(an_array): pos = cuda.grid(1) if pos < an_array.size: an_array[pos] += 1 我们可以发现,numba更进一步,直接使用装饰器的办法让调用GPU的过程更简洁方便 再看看cupy: import numpy as np

Apply function along time dimension of XArray

放肆的年华 提交于 2020-04-30 04:10:30
问题 I have an image stack stored in an XArray DataArray with dimensions time, x, y on which I'd like to apply a custom function along the time axis of each pixel such that the output is a single image of dimensions x,y. I have tried: apply_ufunc but the function fails stating that I need to first load the data into RAM (i.e. cannot use a Dask Array). Ideally, I'd like to keep the DataArray as Dask Arrays internally as it isn't possible to load the entire stack into RAM. The exact error message is

Apply function along time dimension of XArray

本小妞迷上赌 提交于 2020-04-30 04:09:49
问题 I have an image stack stored in an XArray DataArray with dimensions time, x, y on which I'd like to apply a custom function along the time axis of each pixel such that the output is a single image of dimensions x,y. I have tried: apply_ufunc but the function fails stating that I need to first load the data into RAM (i.e. cannot use a Dask Array). Ideally, I'd like to keep the DataArray as Dask Arrays internally as it isn't possible to load the entire stack into RAM. The exact error message is

Moving data from a database to Azure blob storage

懵懂的女人 提交于 2020-04-18 05:41:12
问题 I'm able to use dask.dataframe.read_sql_table to read the data e.g. df = dd.read_sql_table(table='TABLE', uri=uri, index_col='field', npartitions=N) What would be the next (best) steps to saving it as a parquet file in Azure blob storage? From my small research there are a couple of options: Save locally and use https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs?toc=/azure/storage/blobs/toc.json (not great for big data) I believe adlfs is to read from blob use dask

dask-kubernetes: Issue creating pod with uppercase username

拈花ヽ惹草 提交于 2020-04-16 09:54:49
问题 I am learning dask-kubernetes on GKE. I stumbled across an asyncio error (ERROR:asyncio:Task exception was never retrieved). See steps below for the issue. However, additional guidance on using deploying dask-kubernetes with a remote Kubernetes cluster is appreciated (note I used helm with good experience here but want to try the native approach as I can't scale the helm approach). Create the cluster: $ gcloud container clusters create --machine-type n1-standard-2 --num-nodes 2 --zone us

dask: difference between client.persist and client.compute

爱⌒轻易说出口 提交于 2020-04-07 16:11:07
问题 I am confused about what the difference is between client.persist() and client.compute() both seem (in some cases) to start my calculations and both return asynchronous objects, however not in my simple example: In this example from dask.distributed import Client from dask import delayed client = Client() def f(*args): return args result = [delayed(f)(x) for x in range(1000)] x1 = client.compute(result) x2 = client.persist(result) Here x1 and x2 are different but in a less trivial calculation

dask: difference between client.persist and client.compute

自闭症网瘾萝莉.ら 提交于 2020-04-07 16:10:08
问题 I am confused about what the difference is between client.persist() and client.compute() both seem (in some cases) to start my calculations and both return asynchronous objects, however not in my simple example: In this example from dask.distributed import Client from dask import delayed client = Client() def f(*args): return args result = [delayed(f)(x) for x in range(1000)] x1 = client.compute(result) x2 = client.persist(result) Here x1 and x2 are different but in a less trivial calculation

How to parallelize this for loop (or make it faster) using pandas or dask

别等时光非礼了梦想. 提交于 2020-03-23 23:56:18
问题 I want to make this loop significantly faster. It is calculating the move in a row for each feater. The function here is only applied to one column. Later, I am looping through each feature (df.columns) and applying this function. def move_iar(df, feature): lst=[] prev_move_iar = 0 for move in df[feature]: if np.isnan(move): move_iar = 0 lst.append(move_iar) prev_move_iar = move_iar else: if move == 0: move_iar = prev_move_iar lst.append(move_iar) prev_move_iar = move_iar elif (move >= 0 and

Dask Memory Error when running df.to_csv()

假装没事ソ 提交于 2020-03-20 06:27:16
问题 I am trying to index and save large csvs that cannot be loaded into memory. My code to load the csv, perform a computation and index by the new values works without issue. A simplified version is: cluster = LocalCluster(n_workers=6, threads_per_worker=1) client = Client(cluster, memory_limit='1GB') df = dd.read_csv(filepath, header=None, sep=' ', blocksize=25e7) df['new_col'] = df.map_partitions(lambda x: some_function(x)) df = df.set_index(df.new_col, sorted=False) However, when I use large

Dask Memory Error when running df.to_csv()

馋奶兔 提交于 2020-03-20 06:26:39
问题 I am trying to index and save large csvs that cannot be loaded into memory. My code to load the csv, perform a computation and index by the new values works without issue. A simplified version is: cluster = LocalCluster(n_workers=6, threads_per_worker=1) client = Client(cluster, memory_limit='1GB') df = dd.read_csv(filepath, header=None, sep=' ', blocksize=25e7) df['new_col'] = df.map_partitions(lambda x: some_function(x)) df = df.set_index(df.new_col, sorted=False) However, when I use large