dask

filtering with dask read_parquet method gives unwanted results

三世轮回 提交于 2019-12-06 12:57:24
I am trying to read parquet files using the dask read_parquet method and the filters kwarg. however it sometimes doesn't filter according to the given condition. Example: creating and saving data frame with a dates column import pandas as pd import numpy as np import dask.dataframe as dd nums = range(1,6) dates = pd.date_range('2018-07-01', periods=5, freq='1d') df = pd.DataFrame({'dates':dates, 'nums': nums}) ddf = dd.from_pandas(df, npartitions=3).to_parquet('test_par', engine = 'fastparquet') when i read and filter on the dates column from the 'test_par' folder it doesn't seem to work

Slicing out a few rows from a `dask.DataFrame`

て烟熏妆下的殇ゞ 提交于 2019-12-06 10:51:14
Often, when working with a large dask.DataFrame , it would be useful to grab only a few rows on which to test all subsequent operations. Currently, according to Slicing a Dask Dataframe , this is unsupported. I was hoping to then use head to achieve the same (since that command is supported), but that returns a regular pandas DataFrame. I also tried df[:1000] , which executes, but generates an output different from that you'd expect from Pandas. Is there any way to grab the first 1000 rows from a dask.DataFrame ? If your dataframe has a sensibly partitioned index then I recommend using .loc

Dask.dataframe or Alternative: Scalable way of dropping rows of low frequency items

醉酒当歌 提交于 2019-12-06 08:53:50
I am looking for a way to remove rows from a dataframe that contain low frequency items. I adapted the following snippet from this post: import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)), columns = ['A', 'B']) threshold = 10 # Anything that occurs less than this will be removed. value_counts = df.stack().value_counts() # Entire DataFrame to_remove = value_counts[value_counts <= threshold].index df.replace(to_remove, np.nan, inplace=True) The problem is, that this code does not scale, it seems. The line to_remove = value_counts[value_counts <=

Getting year and week from a datetime series in a dask dataframe?

随声附和 提交于 2019-12-06 06:06:10
If I have a Pandas dataframe, and a column that is a datetime type, I can get the year as follows: df['year'] = df['date'].dt.year With a dask dataframe, that does not work. If I compute first, like this: df['year'] = df['date'].compute().dt.year I get ValueError: Not all divisions are known, can't align partitions. Please use set_index or set_partition to set the index. But if I do: df['date'].head().dt.year it works fine! So how do I get the year (or week) of a datetime series in a dask dataframe? The .dt datetime namespace is present on Dask series objects. Here is a self-contained of its

combining tqdm with delayed execution with dask in python

橙三吉。 提交于 2019-12-06 02:40:12
问题 tqdm and dask are both amazing packages for iterations in python. While tqdm implements the needed progress bar, dask implements the multi-thread platform and they both can make iteration process less frustrating. Yet - I'm having troubles to combine them both together. For example, the following code implements a delayed execution in dask , with tqdm.trange progress bar. The thing is that since the delayed is performed quickly, the progress bar ends immediately, while the real computation

Python PANDAS: Converting from pandas/numpy to dask dataframe/array

家住魔仙堡 提交于 2019-12-06 01:44:54
问题 I am working to try to convert a program to be parallelizable/multithreaded with the excellent dask library. Here is the program I am working on converting: Python PANDAS: Stack by Enumerated Date to Create Records Vectorized import pandas as pd import numpy as np import dask.dataframe as dd import dask.array as da from io import StringIO test_data = '''id,transaction_dt,units,measures 1,2018-01-01,4,30.5 1,2018-01-03,4,26.3 2,2018-01-01,3,12.7 2,2018-01-03,3,8.8''' df_test = pd.read_csv

How to create Dask DataFrame from a list of urls?

柔情痞子 提交于 2019-12-05 20:50:50
I have a list of the URLs, and I'd love to read them to the dask data frame at once, but it looks like read_csv can't use an asterisk for http . Is there any way to achieve that? Here is an example: link = 'http://web.mta.info/developers/' data = [ 'data/nyct/turnstile/turnstile_170128.txt', 'data/nyct/turnstile/turnstile_170121.txt', 'data/nyct/turnstile/turnstile_170114.txt', 'data/nyct/turnstile/turnstile_170107.txt' ] and what I want is df = dd.read_csv('XXXX*X') Try using dask.delayed to turn each of your urls into a lazy pandas dataframe and then use dask.dataframe.from_delayed to turn

How do I run a dask.distributed cluster in a single thread?

橙三吉。 提交于 2019-12-05 16:09:29
问题 How can I run a complete Dask.distributed cluster in a single thread? I want to use this for debugging or profiling. Note: this is a frequently asked question. I'm adding the question and answer here to Stack Overflow just for future reuse. 回答1: Local Scheduler If you can get by with the single-machine scheduler's API (just compute) then you can use the single-threaded scheduler x.compute(scheduler='single-threaded') Distributed Scheduler - Single Machine If you want to run a dask.distributed

Sorting in Dask

≯℡__Kan透↙ 提交于 2019-12-05 13:38:46
I want to find an alternative of pandas.dataframe.sort_value function in dask. I came through set_index , but it would sort on a single column. How can I sort multiple columns of Dask data frame? So far Dask does not seem to support sorting by multiple columns. However, making a new column that concatenates the values of the sorted columns may be a usable work-around. d['new_column'] = d.apply(lambda r: str([r.col1,r.col2]), axis=1) d = d.set_index('new_column') d = d.map_partitions(lambda x: x.sort_index()) Edit: The above works if you want to sort by two strings. I recommend creating integer

Managing worker memory on a dask localcluster

限于喜欢 提交于 2019-12-05 12:17:21
I am trying to load a dataset with dask but when it is time to compute my dataset I keep getting problems like this: WARNING - Worker exceeded 95% memory budget. Restarting. I am just working on my local machine, initiating dask as follows: if __name__ == '__main__': libmarket.config.client = Client() # use dask.distributed by default Now in my error messages I keep seeing a reference to a 'memory_limit=' keyword parameter. However I've searched the dask documentation thoroughly and I can't figure out how to increase the bloody worker memory-limit in a single-machine configuration. I have