dask

dask dataframe head() returns empty df

点点圈 提交于 2020-07-21 04:38:27
问题 I have a dask dataframe with an index on one of the columns. The issue is if I do a df.head() it always treturns an empty df, whereas df.tail always returns the correct df. I checked df.head always checks for the first n entries in the first partition. So if i do df.reset_index(), it should work but thats not the case Below is the code to reproduce this: import dask.dataframe as dd import pandas as pd data = pd.DataFrame({ 'i64': np.arange(1000, dtype=np.int64), 'Ii32': np.arange(1000, dtype

dask dataframe head() returns empty df

[亡魂溺海] 提交于 2020-07-21 04:36:29
问题 I have a dask dataframe with an index on one of the columns. The issue is if I do a df.head() it always treturns an empty df, whereas df.tail always returns the correct df. I checked df.head always checks for the first n entries in the first partition. So if i do df.reset_index(), it should work but thats not the case Below is the code to reproduce this: import dask.dataframe as dd import pandas as pd data = pd.DataFrame({ 'i64': np.arange(1000, dtype=np.int64), 'Ii32': np.arange(1000, dtype

dask set_index from large unordered csv file

别来无恙 提交于 2020-07-18 18:40:27
问题 At the risk of being a bit off-topic, I want to show a simple solution for loading large csv files in a dask dataframe where the option sorted=True can be applied and save a significant time of processing. I found the option of doing set_index within dask unworkable for the size of the toy cluster I am using for learning and the size of the files (33GB). So if your problem is loading large unsorted CSV files, ( multiple tens of gigabytes ), into a dask dataframe and quickly start performing

Can I use dask.delayed on a function wrapped with ctypes?

最后都变了- 提交于 2020-07-07 11:45:45
问题 The goal is to use dask.delayed to parallelize some 'embarrassingly parallel' sections of my code. The code involves calling a python function which wraps a c-function using ctypes . To understand the errors I was getting I wrote a very basic example. The c-function: double zippy_sum(double x, double y) { return x + y; } The python: from dask.distributed import Client client = Client(n_workers = 4) client import os import dask import ctypes current_dir = os.getcwd() #os.path.abspath(os.path

Dask dataframes: reading multiple files & storing filename in column

我是研究僧i 提交于 2020-07-05 11:23:26
问题 I regularly use dask.dataframe to read multiple files, as so: import dask.dataframe as dd df = dd.read_csv('*.csv') However, the origin of each row, i.e. which file the data was read from, seems to be forever lost. Is there a way to add this as a column, e.g. df.loc[:100, 'partition'] = 'file1.csv' if file1.csv is the first file and contains 100 rows. This would be applied to each "partition" / file that is read into the dataframe, when compute is triggered as part of a workflow. The idea is

Using Dask from script

别等时光非礼了梦想. 提交于 2020-06-29 05:12:16
问题 Is it possible to run dask from a python script? In interactive session I can just write from dask.distributed import Client client = Client() as described in all tutorials. If I write these lines however in a script.py file and execute it python script.py , it immediately crashes. I found another option I found, is to use MPI: # script.py from dask_mpi import initialize initialize() from dask.distributed import Client client = Client() # Connect this local process to remote workers And then

How to pass data bigger than the VRAM size into the GPU?

一世执手 提交于 2020-06-26 15:53:31
问题 I am trying to pass more data into my GPU than I have VRAM, which results in the following error. CudaAPIError: Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY I created this code to recreate the problem: from numba import cuda import numpy as np @cuda.jit() def addingNumbers (big_array, big_array2, save_array): i = cuda.grid(1) if i < big_array.shape[0]: for j in range (big_array.shape[1]): save_array[i][j] = big_array[i][j] * big_array2[i][j] big_array = np.random.random_sample(

How to pass data bigger than the VRAM size into the GPU?

℡╲_俬逩灬. 提交于 2020-06-26 15:52:06
问题 I am trying to pass more data into my GPU than I have VRAM, which results in the following error. CudaAPIError: Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY I created this code to recreate the problem: from numba import cuda import numpy as np @cuda.jit() def addingNumbers (big_array, big_array2, save_array): i = cuda.grid(1) if i < big_array.shape[0]: for j in range (big_array.shape[1]): save_array[i][j] = big_array[i][j] * big_array2[i][j] big_array = np.random.random_sample(

How to pass data bigger than the VRAM size into the GPU?

最后都变了- 提交于 2020-06-26 15:51:38
问题 I am trying to pass more data into my GPU than I have VRAM, which results in the following error. CudaAPIError: Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY I created this code to recreate the problem: from numba import cuda import numpy as np @cuda.jit() def addingNumbers (big_array, big_array2, save_array): i = cuda.grid(1) if i < big_array.shape[0]: for j in range (big_array.shape[1]): save_array[i][j] = big_array[i][j] * big_array2[i][j] big_array = np.random.random_sample(

Force dask to_parquet to write single file

别等时光非礼了梦想. 提交于 2020-06-17 10:02:19
问题 When using dask.to_parquet(df, filename) a subfolder filename is created and several files are written to that folder, whereas pandas.to_parquet(df, filename) writes exactly one file. Can I use dask's to_parquet (without using compute() to create a pandas df) to just write a single file? 回答1: Writing to a single file is very hard within a parallelism system. Sorry, such an option is not offered by Dask (nor probably any other parallel processing library). You could in theory perform the