dask

variable column name in dask assign() or apply()

白昼怎懂夜的黑 提交于 2019-12-23 15:44:44
问题 I have code that works in pandas , but I'm having trouble converting it to use dask . There is a partial solution here, but it does not allow me to use a variable as the name of the column I am creating/assigning to. Here's the working pandas code: percent_cols = ['num_unique_words', 'num_words_over_6'] def find_fraction(row, col): return row[col] / row['num_words'] for c in percent_cols: df[c] = df.apply(find_fraction, col=c, axis=1) Here's the dask code that doesn't do what I want: data =

Can a dask dataframe with a unordered index cause silent errors?

谁说胖子不能爱 提交于 2019-12-23 15:35:05
问题 Methods around dask.DataFrame all seem to make sure, that the index column is sorted. However, by using from_delayed , it is possible to construct a dask dataframe that has a index column, which is not sorted: pdf1 = delayed(pd.DataFrame(dict(A=[1,2,3], B = [1,1,1])).set_index('A')) pdf2 = delayed(pd.DataFrame(dict(A=[1,2,3], B = [1,1,1])).set_index('A')) ddf = dd.from_delayed([pdf1,pdf2]) #dask.DataFrame with unordered index The combination [index is set, index is not sorted, divisions are

How to programm a stencil with Dask

穿精又带淫゛_ 提交于 2019-12-23 13:24:59
问题 In many occasions, scientists simulates a system's dynamics using a Stencil, this is convolving a mathematical operator over a grid. Commonly, this operation consumes a lot of computational resources. Here is a good explanation of the idea. In numpy, the canonical way of programming a 2D 5-points stencil is as follows: for i in range(rows): for j in range(cols): grid[i, j] = ( grid[i,j] + grid[i-1,j] + grid[i+1,j] + grid[i,j-1] + grid[i,j+1]) / 5 Or, more efficiently, using slicing: grid[1:-1

dask dataframe apply meta

守給你的承諾、 提交于 2019-12-23 07:04:36
问题 I'm wanting to do a frequency count on a single column of a dask dataframe. The code works, but I get an warning complaining that meta is not defined. If I try to define meta I get an error AttributeError: 'DataFrame' object has no attribute 'name' . For this particular use case it doesn't look like I need to define meta but I'd like to know how to do that for future reference. Dummy dataframe and the column frequencies import pandas as pd from dask import dataframe as dd df = pd.DataFrame([[

Persistent dataflows with dask

梦想的初衷 提交于 2019-12-23 05:09:16
问题 I am interested to work with persistent distributed dataflows with features similar to the ones of the Pegasus project: https://pegasus.isi.edu/ for example. Do you think there is a way to do that with dask? I tried to implement something which works with a SLURM cluster and dask. I will below describe my solution in great lines in order to better specify my use case. The idea is to execute medium size tasks (that run between few minutes to hours) which are specified with a graph which can

dask groupby without combining partitions

馋奶兔 提交于 2019-12-23 02:53:04
问题 I have a set of data for which I want to some simple groupby/count operation and I don't seem to be able to do it using dask. Most probably I don't understand the way the groupby/reduce is performed in dask, especially when the index is in the grouping key. So i'll illustrate my problem with toy data. So first I create a dataframe with 3 columns. import pandas as pd import numpy as np np.random.seed(0) df = pd.DataFrame( {"A": np.random.randint(6, size=20), "B": np.random.randint(6, size=20),

Dask - How to concatenate Series into a DataFrame with apply?

二次信任 提交于 2019-12-23 01:16:07
问题 How do I return multiple values from a function applied on a Dask Series? I am trying to return a series from each iteration of dask.Series.apply and for the final result to be a dask.DataFrame . The following code tells me that the meta is wrong. The all-pandas version however works. What's wrong here? Update: I think that I am not specifying the meta/schema correctly. How do I do it correctly? Now it works when I drop the meta argument. However, it raises a warning. I would like to use dask

Replace a dask dataframe partition

天涯浪子 提交于 2019-12-22 00:35:40
问题 Can I replace a dask dataframe partition, with another dask dataframe partition that I've created separately, of the same number of rows and same structure? If yes, how? Is it possible with a different number of rows? 回答1: You can add partitions to the beginning or end of a Dask dataframe using the dd.concat function. You can insert a new partition anywhere in the dataframe by switching to delayed objects, inserting a delayed object into the list, and then switching back to dask dataframe

Row by row processing of a Dask DataFrame

喜欢而已 提交于 2019-12-21 20:38:20
问题 I need to process a large file and to change some values. I would like to do something like that: for index, row in dataFrame.iterrows(): foo = doSomeStuffWith(row) lol = doOtherStuffWith(row) dataFrame['colx'][index] = foo dataFrame['coly'][index] = lol Bad for me, I cannot do dataFrame['colx'][index] = foo ! My number of row is quite large and I need to process a large number of column. So I'm afraid that dask may read the file several times if I do one dataFrame.apply(...) for each column.

Dask dataframe - split column into multiple rows based on delimiter

人盡茶涼 提交于 2019-12-21 17:36:18
问题 What is an efficient way of splitting a column into multiple rows using dask dataframe? For example, let's say I have a csv file which I read using dask to produce the following dask dataframe: id var1 var2 1 A Z,Y 2 B X 3 C W,U,V I would like to convert it to: id var1 var2 1 A Z 1 A Y 2 B X 3 C W 3 C U 3 C V I have looked into the answers for Split (explode) pandas dataframe string entry to separate rows and pandas: How do I split text in a column into multiple rows?. I tried applying the