dask | 易学教程

Loading Cassandra Data into Dask Dataframe

阅读更多关于 Loading Cassandra Data into Dask Dataframe

问题 I am trying to load data from a cassandra database into a Dask dataframe. I have tried querying the following with no success: query="""SELECT * FROM document_table""" df = man.session.execute(query) df = dd.DataFrame(list(df)) TypeError Traceback (most recent call last) <ipython-input-135-021507f6f2ab> in <module>() ----> 1 a = dd.DataFrame(list(df)) TypeError: __init__() missing 3 required positional arguments: 'name', 'meta', and 'divisions' Does anybody know an easy way to load data

apply a lambda function to a dask dataframe

阅读更多关于 apply a lambda function to a dask dataframe

问题 I am looking to apply a lambda function to a dask dataframe to change the lables in a column if its less than a certain percentage. The method that I am using works well for a pandas dataframe but the same code does not work for dask a dataframe. The code is below. df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']}) ddf = dd.from_pandas(df, npartitions=2) df: output: A B C 0 ant cat dog 1 ant peach

How to check if dask dataframe is empty if lazily evaluated?

阅读更多关于 How to check if dask dataframe is empty if lazily evaluated?

问题 I am aware of this question. But check the code(minimal-working example) below: import dask.dataframe as dd import pandas as pd # intialise data of lists. data = {'Name': ['Tom', 'nick', 'krish', 'jack'], 'Age': [20, 21, 19, 18]} # Create DataFrame df = pd.DataFrame(data) dask_df = dd.from_pandas(df, npartitions=1) categoric_df = dask_df.select_dtypes(include="category") When I try to print the categoric_df I get the following error: ValueError: No objects to concatenate And when I check the

How to check if dask dataframe is empty if lazily evaluated?

阅读更多关于 How to check if dask dataframe is empty if lazily evaluated?

How does dask.delayed handle mutable inputs?

阅读更多关于 How does dask.delayed handle mutable inputs?

问题 If I have an mutable object, let's say for example a dict, how does dask handle passing that as an input to delayed functions? Specifically if I make updates to the dict between delayed calls? I tried the following example which seems to suggest that some copying is going on but can you elaborate what exactly dask is doing? In [3]: from dask import delayed In [4]: x = {} In [5]: foo = delayed(print) In [6]: foo(x) Out[6]: Delayed('print-73930550-94a6-43f9-80ab-072bc88c2b88') In [7]: foo(x)

Subsetting Dask DataFrames

阅读更多关于 Subsetting Dask DataFrames

问题 Is this a valid way of loading subsets of a dask dataframe to memory: while i < len_df: j = i + batch_size if j > len_df: j = len_df subset = df.loc[i:j,'source_country_codes'].compute() I read somewhere that this may not be correct because of how dask assigns index numbers because of it dividing the bigger dataframe into smaller pandas dfs. Also I don't think dask dataframes has an iloc attribute. I am using version 0.15.2 In terms of use cases, this would be a way of loading batches of data

Subsetting Dask DataFrames

阅读更多关于 Subsetting Dask DataFrames

Loading large datasets with dask

阅读更多关于 Loading large datasets with dask

问题 I am in an HPC environment with clusters, tightly coupled interconnects, and backing Lustre filesystems. We have been exploring how to leverage Dask to not only provide computation, but also to act as a distributed cache to speed up our workflows. Our proprietary data format is n-dimensional and regular, and we have coded a lazy reader to pass into the from_array/from_delayed methods. We have had some issues with loading and persisting larger-than-memory datasets across a Dask cluster.

Join two large files by column in python

阅读更多关于 Join two large files by column in python

问题 I have 2 files with 38374732 lines in each and size 3.3 G each. I am trying to join them on the first column. For doing so I decided to use pandas with the following code that pulled from Stackoverflow: import pandas as pd import sys a = pd.read_csv(sys.argv[1],sep='\t',encoding="utf-8-sig") b = pd.read_csv(sys.argv[2],sep='\t',encoding="utf-8-sig") chunksize = 10 ** 6 for chunk in a(chunksize=chunksize): merged = chunk.merge(b, on='Bin_ID') merged.to_csv("output.csv", index=False,sep='\t')

Incompatibility of apply in dask and pandas dataframes

阅读更多关于 Incompatibility of apply in dask and pandas dataframes

问题 A sample of the triggers column in my Dask dataframe looks like the following: 0 [Total Traffic, DNS, UDP] 1 [TCP RST] 2 [Total Traffic] 3 [IP Private] 4 [ICMP] Name: triggers, dtype: object I wish to create a one hot encoded version of the above arrays (putting a 1 against the DNS column in row 1 for example) by doing the following. pop_triggers contains all possible values of triggers . for trig in pop_triggers: df[trig] = df.triggers.apply(lambda x: 1 if trig in x else 0) However, the