dask | 易学教程

dask dataframe head() returns empty df

阅读更多关于 dask dataframe head() returns empty df

问题 I have a dask dataframe with an index on one of the columns. The issue is if I do a df.head() it always treturns an empty df, whereas df.tail always returns the correct df. I checked df.head always checks for the first n entries in the first partition. So if i do df.reset_index(), it should work but thats not the case Below is the code to reproduce this: import dask.dataframe as dd import pandas as pd data = pd.DataFrame({ 'i64': np.arange(1000, dtype=np.int64), 'Ii32': np.arange(1000, dtype

dask dataframe head() returns empty df

阅读更多关于 dask dataframe head() returns empty df

dask set_index from large unordered csv file

阅读更多关于 dask set_index from large unordered csv file

问题 At the risk of being a bit off-topic, I want to show a simple solution for loading large csv files in a dask dataframe where the option sorted=True can be applied and save a significant time of processing. I found the option of doing set_index within dask unworkable for the size of the toy cluster I am using for learning and the size of the files (33GB). So if your problem is loading large unsorted CSV files, ( multiple tens of gigabytes ), into a dask dataframe and quickly start performing

Can I use dask.delayed on a function wrapped with ctypes?

阅读更多关于 Can I use dask.delayed on a function wrapped with ctypes?

问题 The goal is to use dask.delayed to parallelize some 'embarrassingly parallel' sections of my code. The code involves calling a python function which wraps a c-function using ctypes . To understand the errors I was getting I wrote a very basic example. The c-function: double zippy_sum(double x, double y) { return x + y; } The python: from dask.distributed import Client client = Client(n_workers = 4) client import os import dask import ctypes current_dir = os.getcwd() #os.path.abspath(os.path

Dask dataframes: reading multiple files & storing filename in column

阅读更多关于 Dask dataframes: reading multiple files & storing filename in column

问题 I regularly use dask.dataframe to read multiple files, as so: import dask.dataframe as dd df = dd.read_csv('*.csv') However, the origin of each row, i.e. which file the data was read from, seems to be forever lost. Is there a way to add this as a column, e.g. df.loc[:100, 'partition'] = 'file1.csv' if file1.csv is the first file and contains 100 rows. This would be applied to each "partition" / file that is read into the dataframe, when compute is triggered as part of a workflow. The idea is

Using Dask from script

阅读更多关于 Using Dask from script

问题 Is it possible to run dask from a python script? In interactive session I can just write from dask.distributed import Client client = Client() as described in all tutorials. If I write these lines however in a script.py file and execute it python script.py , it immediately crashes. I found another option I found, is to use MPI: # script.py from dask_mpi import initialize initialize() from dask.distributed import Client client = Client() # Connect this local process to remote workers And then

How to pass data bigger than the VRAM size into the GPU?

阅读更多关于 How to pass data bigger than the VRAM size into the GPU?

问题 I am trying to pass more data into my GPU than I have VRAM, which results in the following error. CudaAPIError: Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY I created this code to recreate the problem: from numba import cuda import numpy as np @cuda.jit() def addingNumbers (big_array, big_array2, save_array): i = cuda.grid(1) if i < big_array.shape[0]: for j in range (big_array.shape[1]): save_array[i][j] = big_array[i][j] * big_array2[i][j] big_array = np.random.random_sample(

How to pass data bigger than the VRAM size into the GPU?

阅读更多关于 How to pass data bigger than the VRAM size into the GPU?

How to pass data bigger than the VRAM size into the GPU?

阅读更多关于 How to pass data bigger than the VRAM size into the GPU?

Force dask to_parquet to write single file

阅读更多关于 Force dask to_parquet to write single file

问题 When using dask.to_parquet(df, filename) a subfolder filename is created and several files are written to that folder, whereas pandas.to_parquet(df, filename) writes exactly one file. Can I use dask's to_parquet (without using compute() to create a pandas df) to just write a single file? 回答1: Writing to a single file is very hard within a parallelism system. Sorry, such an option is not offered by Dask (nor probably any other parallel processing library). You could in theory perform the