dask | 易学教程

filtering with dask read_parquet method gives unwanted results

阅读更多关于 filtering with dask read_parquet method gives unwanted results

I am trying to read parquet files using the dask read_parquet method and the filters kwarg. however it sometimes doesn't filter according to the given condition. Example: creating and saving data frame with a dates column import pandas as pd import numpy as np import dask.dataframe as dd nums = range(1,6) dates = pd.date_range('2018-07-01', periods=5, freq='1d') df = pd.DataFrame({'dates':dates, 'nums': nums}) ddf = dd.from_pandas(df, npartitions=3).to_parquet('test_par', engine = 'fastparquet') when i read and filter on the dates column from the 'test_par' folder it doesn't seem to work

Slicing out a few rows from a `dask.DataFrame`

阅读更多关于 Slicing out a few rows from a `dask.DataFrame`

Often, when working with a large dask.DataFrame , it would be useful to grab only a few rows on which to test all subsequent operations. Currently, according to Slicing a Dask Dataframe , this is unsupported. I was hoping to then use head to achieve the same (since that command is supported), but that returns a regular pandas DataFrame. I also tried df[:1000] , which executes, but generates an output different from that you'd expect from Pandas. Is there any way to grab the first 1000 rows from a dask.DataFrame ? If your dataframe has a sensibly partitioned index then I recommend using .loc

Dask.dataframe or Alternative: Scalable way of dropping rows of low frequency items

阅读更多关于 Dask.dataframe or Alternative: Scalable way of dropping rows of low frequency items

I am looking for a way to remove rows from a dataframe that contain low frequency items. I adapted the following snippet from this post: import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)), columns = ['A', 'B']) threshold = 10 # Anything that occurs less than this will be removed. value_counts = df.stack().value_counts() # Entire DataFrame to_remove = value_counts[value_counts <= threshold].index df.replace(to_remove, np.nan, inplace=True) The problem is, that this code does not scale, it seems. The line to_remove = value_counts[value_counts <=

Getting year and week from a datetime series in a dask dataframe?

阅读更多关于 Getting year and week from a datetime series in a dask dataframe?

If I have a Pandas dataframe, and a column that is a datetime type, I can get the year as follows: df['year'] = df['date'].dt.year With a dask dataframe, that does not work. If I compute first, like this: df['year'] = df['date'].compute().dt.year I get ValueError: Not all divisions are known, can't align partitions. Please use set_index or set_partition to set the index. But if I do: df['date'].head().dt.year it works fine! So how do I get the year (or week) of a datetime series in a dask dataframe? The .dt datetime namespace is present on Dask series objects. Here is a self-contained of its

combining tqdm with delayed execution with dask in python

阅读更多关于 combining tqdm with delayed execution with dask in python

问题 tqdm and dask are both amazing packages for iterations in python. While tqdm implements the needed progress bar, dask implements the multi-thread platform and they both can make iteration process less frustrating. Yet - I'm having troubles to combine them both together. For example, the following code implements a delayed execution in dask , with tqdm.trange progress bar. The thing is that since the delayed is performed quickly, the progress bar ends immediately, while the real computation

Python PANDAS: Converting from pandas/numpy to dask dataframe/array

阅读更多关于 Python PANDAS: Converting from pandas/numpy to dask dataframe/array

问题 I am working to try to convert a program to be parallelizable/multithreaded with the excellent dask library. Here is the program I am working on converting: Python PANDAS: Stack by Enumerated Date to Create Records Vectorized import pandas as pd import numpy as np import dask.dataframe as dd import dask.array as da from io import StringIO test_data = '''id,transaction_dt,units,measures 1,2018-01-01,4,30.5 1,2018-01-03,4,26.3 2,2018-01-01,3,12.7 2,2018-01-03,3,8.8''' df_test = pd.read_csv

How to create Dask DataFrame from a list of urls?

阅读更多关于 How to create Dask DataFrame from a list of urls?

I have a list of the URLs, and I'd love to read them to the dask data frame at once, but it looks like read_csv can't use an asterisk for http . Is there any way to achieve that? Here is an example: link = 'http://web.mta.info/developers/' data = [ 'data/nyct/turnstile/turnstile_170128.txt', 'data/nyct/turnstile/turnstile_170121.txt', 'data/nyct/turnstile/turnstile_170114.txt', 'data/nyct/turnstile/turnstile_170107.txt' ] and what I want is df = dd.read_csv('XXXX*X') Try using dask.delayed to turn each of your urls into a lazy pandas dataframe and then use dask.dataframe.from_delayed to turn

How do I run a dask.distributed cluster in a single thread?

阅读更多关于 How do I run a dask.distributed cluster in a single thread?

问题 How can I run a complete Dask.distributed cluster in a single thread? I want to use this for debugging or profiling. Note: this is a frequently asked question. I'm adding the question and answer here to Stack Overflow just for future reuse. 回答1: Local Scheduler If you can get by with the single-machine scheduler's API (just compute) then you can use the single-threaded scheduler x.compute(scheduler='single-threaded') Distributed Scheduler - Single Machine If you want to run a dask.distributed

Sorting in Dask

阅读更多关于 Sorting in Dask

I want to find an alternative of pandas.dataframe.sort_value function in dask. I came through set_index , but it would sort on a single column. How can I sort multiple columns of Dask data frame? So far Dask does not seem to support sorting by multiple columns. However, making a new column that concatenates the values of the sorted columns may be a usable work-around. d['new_column'] = d.apply(lambda r: str([r.col1,r.col2]), axis=1) d = d.set_index('new_column') d = d.map_partitions(lambda x: x.sort_index()) Edit: The above works if you want to sort by two strings. I recommend creating integer

Managing worker memory on a dask localcluster

阅读更多关于 Managing worker memory on a dask localcluster

I am trying to load a dataset with dask but when it is time to compute my dataset I keep getting problems like this: WARNING - Worker exceeded 95% memory budget. Restarting. I am just working on my local machine, initiating dask as follows: if __name__ == '__main__': libmarket.config.client = Client() # use dask.distributed by default Now in my error messages I keep seeing a reference to a 'memory_limit=' keyword parameter. However I've searched the dask documentation thoroughly and I can't figure out how to increase the bloody worker memory-limit in a single-machine configuration. I have