dask

Loading Cassandra Data into Dask Dataframe

心已入冬 提交于 2020-01-05 12:58:35
问题 I am trying to load data from a cassandra database into a Dask dataframe. I have tried querying the following with no success: query="""SELECT * FROM document_table""" df = man.session.execute(query) df = dd.DataFrame(list(df)) TypeError Traceback (most recent call last) <ipython-input-135-021507f6f2ab> in <module>() ----> 1 a = dd.DataFrame(list(df)) TypeError: __init__() missing 3 required positional arguments: 'name', 'meta', and 'divisions' Does anybody know an easy way to load data

apply a lambda function to a dask dataframe

微笑、不失礼 提交于 2020-01-05 08:36:52
问题 I am looking to apply a lambda function to a dask dataframe to change the lables in a column if its less than a certain percentage. The method that I am using works well for a pandas dataframe but the same code does not work for dask a dataframe. The code is below. df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']}) ddf = dd.from_pandas(df, npartitions=2) df: output: A B C 0 ant cat dog 1 ant peach

How to check if dask dataframe is empty if lazily evaluated?

有些话、适合烂在心里 提交于 2020-01-05 07:13:33
问题 I am aware of this question. But check the code(minimal-working example) below: import dask.dataframe as dd import pandas as pd # intialise data of lists. data = {'Name': ['Tom', 'nick', 'krish', 'jack'], 'Age': [20, 21, 19, 18]} # Create DataFrame df = pd.DataFrame(data) dask_df = dd.from_pandas(df, npartitions=1) categoric_df = dask_df.select_dtypes(include="category") When I try to print the categoric_df I get the following error: ValueError: No objects to concatenate And when I check the

How to check if dask dataframe is empty if lazily evaluated?

孤街醉人 提交于 2020-01-05 07:13:25
问题 I am aware of this question. But check the code(minimal-working example) below: import dask.dataframe as dd import pandas as pd # intialise data of lists. data = {'Name': ['Tom', 'nick', 'krish', 'jack'], 'Age': [20, 21, 19, 18]} # Create DataFrame df = pd.DataFrame(data) dask_df = dd.from_pandas(df, npartitions=1) categoric_df = dask_df.select_dtypes(include="category") When I try to print the categoric_df I get the following error: ValueError: No objects to concatenate And when I check the

How does dask.delayed handle mutable inputs?

怎甘沉沦 提交于 2020-01-05 04:24:09
问题 If I have an mutable object, let's say for example a dict, how does dask handle passing that as an input to delayed functions? Specifically if I make updates to the dict between delayed calls? I tried the following example which seems to suggest that some copying is going on but can you elaborate what exactly dask is doing? In [3]: from dask import delayed In [4]: x = {} In [5]: foo = delayed(print) In [6]: foo(x) Out[6]: Delayed('print-73930550-94a6-43f9-80ab-072bc88c2b88') In [7]: foo(x)

Subsetting Dask DataFrames

两盒软妹~` 提交于 2020-01-04 14:03:34
问题 Is this a valid way of loading subsets of a dask dataframe to memory: while i < len_df: j = i + batch_size if j > len_df: j = len_df subset = df.loc[i:j,'source_country_codes'].compute() I read somewhere that this may not be correct because of how dask assigns index numbers because of it dividing the bigger dataframe into smaller pandas dfs. Also I don't think dask dataframes has an iloc attribute. I am using version 0.15.2 In terms of use cases, this would be a way of loading batches of data

Subsetting Dask DataFrames

最后都变了- 提交于 2020-01-04 14:03:03
问题 Is this a valid way of loading subsets of a dask dataframe to memory: while i < len_df: j = i + batch_size if j > len_df: j = len_df subset = df.loc[i:j,'source_country_codes'].compute() I read somewhere that this may not be correct because of how dask assigns index numbers because of it dividing the bigger dataframe into smaller pandas dfs. Also I don't think dask dataframes has an iloc attribute. I am using version 0.15.2 In terms of use cases, this would be a way of loading batches of data

Loading large datasets with dask

倖福魔咒の 提交于 2020-01-04 13:44:57
问题 I am in an HPC environment with clusters, tightly coupled interconnects, and backing Lustre filesystems. We have been exploring how to leverage Dask to not only provide computation, but also to act as a distributed cache to speed up our workflows. Our proprietary data format is n-dimensional and regular, and we have coded a lazy reader to pass into the from_array/from_delayed methods. We have had some issues with loading and persisting larger-than-memory datasets across a Dask cluster.

Join two large files by column in python

穿精又带淫゛_ 提交于 2020-01-04 05:48:05
问题 I have 2 files with 38374732 lines in each and size 3.3 G each. I am trying to join them on the first column. For doing so I decided to use pandas with the following code that pulled from Stackoverflow: import pandas as pd import sys a = pd.read_csv(sys.argv[1],sep='\t',encoding="utf-8-sig") b = pd.read_csv(sys.argv[2],sep='\t',encoding="utf-8-sig") chunksize = 10 ** 6 for chunk in a(chunksize=chunksize): merged = chunk.merge(b, on='Bin_ID') merged.to_csv("output.csv", index=False,sep='\t')

Incompatibility of apply in dask and pandas dataframes

泪湿孤枕 提交于 2020-01-04 05:34:50
问题 A sample of the triggers column in my Dask dataframe looks like the following: 0 [Total Traffic, DNS, UDP] 1 [TCP RST] 2 [Total Traffic] 3 [IP Private] 4 [ICMP] Name: triggers, dtype: object I wish to create a one hot encoded version of the above arrays (putting a 1 against the DNS column in row 1 for example) by doing the following. pop_triggers contains all possible values of triggers . for trig in pop_triggers: df[trig] = df.triggers.apply(lambda x: 1 if trig in x else 0) However, the