dask

Slicing a Dask Dataframe

时间秒杀一切 提交于 2019-12-09 13:05:03
问题 I have the following code where I like to do a train/test split on a Dask dataframe df = dd.read_csv(csv_filename, sep=',', encoding="latin-1", names=cols, header=0, dtype='str') But when I try to do slices like for train, test in cv.split(X, y): df.fit(X[train], y[train]) it fails with the error KeyError: '[11639 11641 11642 ..., 34997 34998 34999] not in index' Any ideas? 回答1: Dask.dataframe doesn't support row-wise slicing. It does support the loc operation if you have a sensible index.

ValueError: Not all divisions are known, can't align partitions error on dask dataframe

自作多情 提交于 2019-12-08 21:45:48
问题 I have the following pandas dataframe with the following columns user_id user_agent_id requests All columns contain integers. I wan't to perform some operations on them and run them using dask dataframe. This is what I do. user_profile = cache_records_dataframe[['user_id', 'user_agent_id', 'requests']] \ .groupby(['user_id', 'user_agent_id']) \ .size().to_frame(name='appearances') \ .reset_index() # I am not sure I can run this on dask dataframe user_profile_ddf = df.from_pandas(user_profile,

Is there a way to get the nlargest items per group in dask?

橙三吉。 提交于 2019-12-08 18:30:18
问题 I have the following dataset: location category percent A 5 100.0 B 3 100.0 C 2 50.0 4 13.0 D 2 75.0 3 59.0 4 13.0 5 4.0 And I'm trying to get the nlargest items of category in dataframe grouped by location. i.e. If I want the top 2 largest percentages for each group the output should be: location category percent A 5 100.0 B 3 100.0 C 2 50.0 4 13.0 D 2 75.0 3 59.0 It looks like in pandas this is relatively straight forward using pandas.core.groupby.SeriesGroupBy.nlargest but dask doesn't

Shuffling data in dask

做~自己de王妃 提交于 2019-12-08 17:53:18
问题 This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm. The answer in that question was to do the following: for part in df.repartition(npartitions=100).to_delayed(): batch = part.compute() However, even if I was to shuffle the contents of batch I'm a bit worried that it might not be ideal. The data is a time series set so datapoints would be highly correlated within each partition. What I would

What do KilledWorker exceptions mean in Dask?

柔情痞子 提交于 2019-12-08 17:34:32
问题 My tasks are returning with KilledWorker exceptions when using Dask with the dask.distributed scheduler. What do these errors mean? 回答1: This error is generated when the Dask scheduler no longer trusts your task, because it was present too often when workers died unexpectedly. It is designed to protect the cluster against tasks that kill workers, for example by segfaults or memory errors. Whenever a worker dies unexpectedly the scheduler notes which tasks were running on that worker when it

Unpacking result of delayed function

早过忘川 提交于 2019-12-08 17:15:13
问题 While converting my program using delayed, I stumbled upon a commonly used programming pattern that doesn't work with delayed. Example: from dask import delayed @delayed def myFunction(): return 1,2 a, b = myFunction() a.compute() Raises: TypeError: Delayed objects of unspecified length are not iterable While the following work around does not. But looks a lot more clumsy from dask import delayed @delayed def myFunction(): return 1,2 dummy = myFunction() a, b = dummy[0], dummy[1] a.compute()

multiplication of large arrays in python

血红的双手。 提交于 2019-12-08 15:54:30
I have big arrays to multiply in large number of iterations also. I am training a model with array long around 1500 and I will perform 3 multiplications for about 1000000 times which takes a long time almost week. I found Dask I tried to compare it with the normal numpy way but I found numpy faster: x = np.arange(2000) start = time.time() y = da.from_array(x, chunks=(100)) for i in range (0,100): p = y.dot(y) #print(p) print( time.time() - start) print('------------------------------') start = time.time() p = 0 for i in range (0,100): p = np.dot(x,x) print(time.time() - start) 0

Passing a Paramiko connection SFTPFile as input to a dask.dataframe.read_parquet

徘徊边缘 提交于 2019-12-08 08:10:38
问题 I tried to pass class paramiko.sftp_file.SFTPFile instead of file URL for pandas.read_parquet and it worked fine. But when I tried the same with Dask, it threw an error. Below is the code I tried to run and the error I get. How can I make this work? import dask.dataframe as dd import parmiko ssh=paramiko.SSHClient() sftp_client = ssh.open_sftp() ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy()) source_file=sftp_client.open(str(parquet_file),'rb') full_df = dd.read_parquet(source_file

Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?

这一生的挚爱 提交于 2019-12-08 06:45:39
问题 Today I began using the Dask and Paramiko packages, partly as a learning exercise, and partly because I'm beginning a project that will require dealing with large datasets (10s of GB) that must be accessed from a remote VM only (i.e. cannot store locally)– I have login credentials, and sudo rights of this VM. I have minimal data analytics experience, and no experience working with datasets over a few thousand rows in size. The following piece of code belongs to a short, helper program that

3D volume processing using dask

倖福魔咒の 提交于 2019-12-08 06:13:48
问题 I’m exploring 3D interactive volume convolution with some simple stencils using dask right now. Let me explain what I mean: Assume that you have a 3D data which you would like to process through Sobel Transform (for example to get L1 or L2 gradient). Then you divide your input 3D image into subvolumes (with some overlapping boundaries – for 3x3x3 stencil Sobel it will demand +2 samples overlap/padding) Now let’s assume that you create a delayed computation of the Sobel 3D transform on entire