castra

Dask DataFrame: Resample over groupby object with multiple rows

痴心易碎 提交于 2019-12-13 12:11:59
问题 I have the following dask dataframe created from Castra: import dask.dataframe as dd df = dd.from_castra('data.castra', columns=['user_id','ts','text']) Yielding: user_id / ts / text ts 2015-08-08 01:10:00 9235 2015-08-08 01:10:00 a 2015-08-08 02:20:00 2353 2015-08-08 02:20:00 b 2015-08-08 02:20:00 9235 2015-08-08 02:20:00 c 2015-08-08 04:10:00 9235 2015-08-08 04:10:00 d 2015-08-08 08:10:00 2353 2015-08-08 08:10:00 e What I'm trying to do is: Group by user_id and ts Resample it over a 3-hour

dask computation not executing in parallel

混江龙づ霸主 提交于 2019-12-10 03:53:59
问题 I have a directory of json files that I am trying to convert to a dask DataFrame and save it to castra. There are 200 files containing O(10**7) json records between them. The code is very simple largely following tutorial examples. import dask.dataframe as dd import dask.bag as db import json txt = db.from_filenames('part-*.json') js = txt.map(json.loads) df = js.to_dataframe() cs=df.to_castra("data.castra") I am running it on a 32 core machine, but the code only utilizes one core at 100%. My

dask computation not executing in parallel

二次信任 提交于 2019-12-05 03:45:54
I have a directory of json files that I am trying to convert to a dask DataFrame and save it to castra. There are 200 files containing O(10**7) json records between them. The code is very simple largely following tutorial examples. import dask.dataframe as dd import dask.bag as db import json txt = db.from_filenames('part-*.json') js = txt.map(json.loads) df = js.to_dataframe() cs=df.to_castra("data.castra") I am running it on a 32 core machine, but the code only utilizes one core at 100%. My understanding from the docs is that this code execute in parallel. Why is it not? Did I misunderstand

Dask DataFrame: Resample over groupby object with multiple rows

試著忘記壹切 提交于 2019-12-04 18:09:15
I have the following dask dataframe created from Castra: import dask.dataframe as dd df = dd.from_castra('data.castra', columns=['user_id','ts','text']) Yielding: user_id / ts / text ts 2015-08-08 01:10:00 9235 2015-08-08 01:10:00 a 2015-08-08 02:20:00 2353 2015-08-08 02:20:00 b 2015-08-08 02:20:00 9235 2015-08-08 02:20:00 c 2015-08-08 04:10:00 9235 2015-08-08 04:10:00 d 2015-08-08 08:10:00 2353 2015-08-08 08:10:00 e What I'm trying to do is: Group by user_id and ts Resample it over a 3-hour period In the resampling step, any merged rows should concatenate the texts Example output: text user