dask | 易学教程

使用Python玩转GPU

阅读更多关于使用Python玩转GPU

问题随着机器学习对模型运算速度的需求越来越强烈，一直想进行GPU编程，但一直以来这些都是c++的专利一想到c++里的各种坑，就提不起劲来，毕竟这样来来回回填坑的投入产出，生产效率就会大打折扣解决方案让人欣喜的是，随着Python阵营的不断发展壮大，使用python进行GPU编程也越来越便捷了那么具体有些什么样的包，能针对GPU做些啥事呢？看看一些具体的代码，就能大概明白：首先是pycuda，这是它的一个例子： mod = SourceModule(""" __global__ void multiply_them(float *dest, float *a, float *b) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } """) 由上面的代码我们可以看出，pycuda将调用gpu的c++代码做了包装，可以在python里直接使用再看看numba： @cuda.jit def increment_by_one(an_array): pos = cuda.grid(1) if pos < an_array.size: an_array[pos] += 1 我们可以发现，numba更进一步，直接使用装饰器的办法让调用GPU的过程更简洁方便再看看cupy： import numpy as np

Apply function along time dimension of XArray

阅读更多关于 Apply function along time dimension of XArray

问题 I have an image stack stored in an XArray DataArray with dimensions time, x, y on which I'd like to apply a custom function along the time axis of each pixel such that the output is a single image of dimensions x,y. I have tried: apply_ufunc but the function fails stating that I need to first load the data into RAM (i.e. cannot use a Dask Array). Ideally, I'd like to keep the DataArray as Dask Arrays internally as it isn't possible to load the entire stack into RAM. The exact error message is

Apply function along time dimension of XArray

阅读更多关于 Apply function along time dimension of XArray

Moving data from a database to Azure blob storage

阅读更多关于 Moving data from a database to Azure blob storage

问题 I'm able to use dask.dataframe.read_sql_table to read the data e.g. df = dd.read_sql_table(table='TABLE', uri=uri, index_col='field', npartitions=N) What would be the next (best) steps to saving it as a parquet file in Azure blob storage? From my small research there are a couple of options: Save locally and use https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs?toc=/azure/storage/blobs/toc.json (not great for big data) I believe adlfs is to read from blob use dask

dask-kubernetes: Issue creating pod with uppercase username

阅读更多关于 dask-kubernetes: Issue creating pod with uppercase username

问题 I am learning dask-kubernetes on GKE. I stumbled across an asyncio error (ERROR:asyncio:Task exception was never retrieved). See steps below for the issue. However, additional guidance on using deploying dask-kubernetes with a remote Kubernetes cluster is appreciated (note I used helm with good experience here but want to try the native approach as I can't scale the helm approach). Create the cluster: $ gcloud container clusters create --machine-type n1-standard-2 --num-nodes 2 --zone us

dask: difference between client.persist and client.compute

阅读更多关于 dask: difference between client.persist and client.compute

问题 I am confused about what the difference is between client.persist() and client.compute() both seem (in some cases) to start my calculations and both return asynchronous objects, however not in my simple example: In this example from dask.distributed import Client from dask import delayed client = Client() def f(*args): return args result = [delayed(f)(x) for x in range(1000)] x1 = client.compute(result) x2 = client.persist(result) Here x1 and x2 are different but in a less trivial calculation

dask: difference between client.persist and client.compute

阅读更多关于 dask: difference between client.persist and client.compute

How to parallelize this for loop (or make it faster) using pandas or dask

阅读更多关于 How to parallelize this for loop (or make it faster) using pandas or dask

问题 I want to make this loop significantly faster. It is calculating the move in a row for each feater. The function here is only applied to one column. Later, I am looping through each feature (df.columns) and applying this function. def move_iar(df, feature): lst=[] prev_move_iar = 0 for move in df[feature]: if np.isnan(move): move_iar = 0 lst.append(move_iar) prev_move_iar = move_iar else: if move == 0: move_iar = prev_move_iar lst.append(move_iar) prev_move_iar = move_iar elif (move >= 0 and

Dask Memory Error when running df.to_csv()

阅读更多关于 Dask Memory Error when running df.to_csv()

问题 I am trying to index and save large csvs that cannot be loaded into memory. My code to load the csv, perform a computation and index by the new values works without issue. A simplified version is: cluster = LocalCluster(n_workers=6, threads_per_worker=1) client = Client(cluster, memory_limit='1GB') df = dd.read_csv(filepath, header=None, sep=' ', blocksize=25e7) df['new_col'] = df.map_partitions(lambda x: some_function(x)) df = df.set_index(df.new_col, sorted=False) However, when I use large

Dask Memory Error when running df.to_csv()

阅读更多关于 Dask Memory Error when running df.to_csv()