dask

iterate over GroupBy object in dask

妖精的绣舞 提交于 2019-11-28 03:58:51
问题 Is it possible, to iterate over a dask GroupBy object to get access to the underlying dataframes? I tried: import dask.dataframe as dd import pandas as pd pdf = pd.DataFrame({'A':[1,2,3,4,5], 'B':['1','1','a','a','a']}) ddf = dd.from_pandas(pdf, npartitions = 3) groups = ddf.groupby('B') for name, df in groups: print(name) However, this results in an error: KeyError: 'Column not found: 0' More generally speaking, what kind of interactions does the dask GroupBy object allow, except from the

How do you parallelize apply() on Pandas Dataframes making use of all cores on one machine?

烈酒焚心 提交于 2019-11-28 03:11:33
As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its compute-time when you run df.apply(myfunc, axis=1) . How can you use all your cores to run apply on a dataframe in parallel? The simplest way is to use Dask's map_partitions . You need these imports (you will need to pip install dask ): import pandas as pd import dask.dataframe as dd from dask.multiprocessing import get and the syntax is data = <your_pandas_dataframe> ddata = dd.from_pandas(data, npartitions=30) def myfunc(x,y

Convert raster time series of multiple GeoTIFF images to NetCDF

∥☆過路亽.° 提交于 2019-11-27 16:56:26
问题 I have a raster time series stored in multiple GeoTIFF files ( *.tif ) that I'd like to convert to a single NetCDF file. The data is uint16 . I could probably use gdal_translate to convert each image to netcdf using: gdal_translate -of netcdf -co FORMAT=NC4 20150520_0164.tif foo.nc and then some scripting with NCO to extract dates from filenames and then concatenate, but I was wondering whether I might do this more effectively in Python using xarray and it's new rasterio backend. I can read a

how to parallelize many (fuzzy) string comparisons using apply in Pandas?

你说的曾经没有我的故事 提交于 2019-11-27 11:36:34
I have the following problem I have a dataframe master that contains sentences, such as master Out[8]: original 0 this is a nice sentence 1 this is another one 2 stackoverflow is nice For every row in Master, I lookup into another Dataframe slave for the best match using fuzzywuzzy . I use fuzzywuzzy because the matched sentences between the two dataframes could differ a bit (extra characters, etc). For instance, slave could be slave Out[10]: my_value name 0 2 hello world 1 1 congratulations 2 2 this is a nice sentence 3 3 this is another one 4 1 stackoverflow is nice Here is a fully

python dask DataFrame, support for (trivially parallelizable) row apply?

给你一囗甜甜゛ 提交于 2019-11-27 11:20:55
I recently found dask module that aims to be an easy-to-use python parallel processing module. Big selling point for me is that it works with pandas. After reading a bit on its manual page, I can't find a way to do this trivially parallelizable task: ts.apply(func) # for pandas series df.apply(func, axis = 1) # for pandas DF row apply At the moment, to achieve this in dask, AFAIK, ddf.assign(A=lambda df: df.apply(func, axis=1)).compute() # dask DataFrame which is ugly syntax and is actually slower than outright df.apply(func, axis = 1) # for pandas DF row apply Any suggestion? Edit: Thanks

Why does Dask perform so slower while multiprocessing perform so much faster?

感情迁移 提交于 2019-11-27 07:20:07
问题 To get a better understanding about parallel, I am comparing a set of different pieces of code. Here is the basic one (code_piece_1). for loop import time # setup problem_size = 1e7 items = range(9) # serial def counter(num=0): junk = 0 for i in range(int(problem_size)): junk += 1 junk -= 1 return num def sum_list(args): print("sum_list fn:", args) return sum(args) start = time.time() summed = sum_list([counter(i) for i in items]) print(summed) print('for loop {}s'.format(time.time() - start)

How do you parallelize apply() on Pandas Dataframes making use of all cores on one machine?

别等时光非礼了梦想. 提交于 2019-11-27 05:00:47
问题 As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its compute-time when you run df.apply(myfunc, axis=1) . How can you use all your cores to run apply on a dataframe in parallel? 回答1: The simplest way is to use Dask's map_partitions. You need these imports (you will need to pip install dask ): import pandas as pd import dask.dataframe as dd from dask.multiprocessing import get

how to parallelize many (fuzzy) string comparisons using apply in Pandas?

本小妞迷上赌 提交于 2019-11-26 15:32:46
问题 I have the following problem I have a dataframe master that contains sentences, such as master Out[8]: original 0 this is a nice sentence 1 this is another one 2 stackoverflow is nice For every row in Master, I lookup into another Dataframe slave for the best match using fuzzywuzzy . I use fuzzywuzzy because the matched sentences between the two dataframes could differ a bit (extra characters, etc). For instance, slave could be slave Out[10]: my_value name 0 2 hello world 1 1 congratulations

python dask DataFrame, support for (trivially parallelizable) row apply?

半腔热情 提交于 2019-11-26 15:26:15
问题 I recently found dask module that aims to be an easy-to-use python parallel processing module. Big selling point for me is that it works with pandas. After reading a bit on its manual page, I can't find a way to do this trivially parallelizable task: ts.apply(func) # for pandas series df.apply(func, axis = 1) # for pandas DF row apply At the moment, to achieve this in dask, AFAIK, ddf.assign(A=lambda df: df.apply(func, axis=1)).compute() # dask DataFrame which is ugly syntax and is actually