dask

Dask Memory Error when running df.to_csv()

江枫思渺然 提交于 2020-03-20 06:26:05
问题 I am trying to index and save large csvs that cannot be loaded into memory. My code to load the csv, perform a computation and index by the new values works without issue. A simplified version is: cluster = LocalCluster(n_workers=6, threads_per_worker=1) client = Client(cluster, memory_limit='1GB') df = dd.read_csv(filepath, header=None, sep=' ', blocksize=25e7) df['new_col'] = df.map_partitions(lambda x: some_function(x)) df = df.set_index(df.new_col, sorted=False) However, when I use large

Convert Pandas dataframe to Dask dataframe

。_饼干妹妹 提交于 2020-02-26 07:13:54
问题 Suppose I have pandas dataframe as: df=pd.DataFrame({'a':[1,2,3],'b':[4,5,6]}) When I convert it into dask dataframe what should name and divisions parameter consist of: from dask import dataframe as dd sd=dd.DataFrame(df.to_dict(),divisions=1,meta=pd.DataFrame(columns=df.columns,index=df.index)) TypeError: init () missing 1 required positional argument: 'name' Edit : Suppose I create a pandas dataframe like: pd.DataFrame({'a':[1,2,3],'b':[4,5,6]}) Similarly how to create dask dataframe as it

Convert Pandas dataframe to Dask dataframe

牧云@^-^@ 提交于 2020-02-26 07:12:41
问题 Suppose I have pandas dataframe as: df=pd.DataFrame({'a':[1,2,3],'b':[4,5,6]}) When I convert it into dask dataframe what should name and divisions parameter consist of: from dask import dataframe as dd sd=dd.DataFrame(df.to_dict(),divisions=1,meta=pd.DataFrame(columns=df.columns,index=df.index)) TypeError: init () missing 1 required positional argument: 'name' Edit : Suppose I create a pandas dataframe like: pd.DataFrame({'a':[1,2,3],'b':[4,5,6]}) Similarly how to create dask dataframe as it

Compiling Executable with dask or joblib multiprocessing with cython results in errors

删除回忆录丶 提交于 2020-02-22 15:33:33
问题 I'm converting some serial processed python jobs to multiprocessing with dask or joblib. Sadly I need to work on windows. When running from within IPython or from command line invoking the py-file with python everything is running fine. When compiling an executable with cython, it is no longer running fine: Step by step more and more processes (unlimited and bigger than the number of requested processes) get startet and block my system. It somehow feels like Multiprocessing Bomb - but of

Dask Array from DataFrame

不想你离开。 提交于 2020-02-13 09:20:42
问题 Is there a way to easily convert a DataFrame of numeric values into an Array? Similar to values with a pandas DataFrame. I can't seem to find any way to do this with the provided API, but I'd assume it's a common operation. 回答1: Edit: yes, now this is trivial You can use the .values property x = df.values Older, now incorrect answer At the moment there is no trivial way to do this. This is because dask.array needs to know the length of all of its chunks and dask.dataframe doesn't know this

DASK: Typerrror: Column assignment doesn't support type numpy.ndarray whereas Pandas works fine

£可爱£侵袭症+ 提交于 2020-02-07 02:33:13
问题 I'm using Dask to read in a 10m row csv+ and perform some calculations. So far it's proving to be 10x faster than Pandas. I have a piece of code, below, that when used with pandas works fine, but with dask throws a type error. I am unsure of how to overcome the typerror . It seems like an array is being handed back to the dataframe/column by the select function when using dask, but not when using pandas? But I don't want to switch the whole thing back to pandas and lose the 10x performance

DASK: Typerrror: Column assignment doesn't support type numpy.ndarray whereas Pandas works fine

寵の児 提交于 2020-02-07 02:31:29
问题 I'm using Dask to read in a 10m row csv+ and perform some calculations. So far it's proving to be 10x faster than Pandas. I have a piece of code, below, that when used with pandas works fine, but with dask throws a type error. I am unsure of how to overcome the typerror . It seems like an array is being handed back to the dataframe/column by the select function when using dask, but not when using pandas? But I don't want to switch the whole thing back to pandas and lose the 10x performance

How to read a single large parquet file into multiple partitions using dask/dask-cudf?

∥☆過路亽.° 提交于 2020-02-05 13:09:23
问题 I am trying to read a single large parquet file (size > gpu_size), using dask_cudf / dask but it is currently reading it into a single partition, which i am guessing is the expected behavior inferring from the doc-string: dask.dataframe.read_parquet(path, columns=None, filters=None, categories=None, index=None, storage_options=None, engine='auto', gather_statistics=None, **kwargs): Read a Parquet file into a Dask DataFrame This reads a directory of Parquet data into a Dask.dataframe, one file

An attempt has been made to start a new process before the current process has finished its bootstrapping phase

风格不统一 提交于 2020-02-02 11:24:44
问题 I am new to dask and I found so nice to have a module that makes it easy to get parallelization. I am working on a project where I was able to parallelize in a single machine a loop as you can see here . However, I would like to move over to dask.distributed . I applied the following changes to the class above: diff --git a/mlchem/fingerprints/gaussian.py b/mlchem/fingerprints/gaussian.py index ce6a72b..89f8638 100644 --- a/mlchem/fingerprints/gaussian.py +++ b/mlchem/fingerprints/gaussian.py

Dask read_csv fails where pandas doesn't

白昼怎懂夜的黑 提交于 2020-02-02 00:58:07
问题 Trying to use dask's read_csv on file where pandas's read_csv like this dd.read_csv('data/ecommerce-new.csv') fails with the following error: pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2 The file is csv file of scraped data using scrapy with two columns, one with the url and the other with the html(which is stored multiline using " as delimiter char). Being actually parsed by pandas means it should be actually well-formatted. html,url https:/