dask

Does Dask communicate with HDFS to optimize for data locality?

◇◆丶佛笑我妖孽 提交于 2019-12-08 05:06:49
问题 In Dask distributed documentation, they have the following information: For example Dask developers use this ability to build in data locality when we communicate to data-local storage systems like the Hadoop File System. When users use high-level functions like dask.dataframe.read_csv('hdfs:///path/to/files.*.csv') Dask talks to the HDFS name node, finds the locations of all of the blocks of data, and sends that information to the scheduler so that it can make smarter decisions and improve

File Not Found Error in Dask program run on cluster

与世无争的帅哥 提交于 2019-12-08 04:51:12
问题 I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers. When I run the program with read_csv file in dask. It gives me Error, file not found 回答1: When one of your workers tries to load the CSV, it will not be able to find it, because it is not present on that local disc. This should not be a surprise. You can get around this in a number of ways: copy the file to every worker; this is obviously wasteful in

calculating cross-correlation function in xarray

谁都会走 提交于 2019-12-08 04:50:00
问题 I have a dataset res_1 with Dimensions: (space: 726, time: 579) Coordinates: * space (space) MultiIndex - latitude (space) float64 -90.0 -82.5 -82.5 -82.5 -82.5 -82.5 -82.5 ... - longitude (space) float64 0.0 0.0 60.0 120.0 180.0 240.0 300.0 0.0 30.0 ... * time (time) datetime64[ns] 1980-06-01 1980-06-02 1980-06-03 ... Data variables: mx2t (time, space) float64 -1.768 -0.6035 -1.286 -1.291 1.144 ... dayofyear (time) int64 153 154 155 156 157 158 159 160 161 162 163 164 ... the space variable

Convert a column in a dask dataframe to a TaggedDocument for Doc2Vec

那年仲夏 提交于 2019-12-08 03:47:29
问题 Intro Currently I am trying to use dask in concert with gensim to do NLP document computation and I'm running into an issue when converting my corpus into a "TaggedDocument". Because I've tried so many different ways to wrangle this problem I'll list my attempts. Each attempt at dealing with this problem is met with slightly different woes. First some initial givens. The Data df.info() <class 'dask.dataframe.core.DataFrame'> Columns: 5 entries, claim_no to litigation dtypes: object(2), int64

filtering with dask read_parquet method gives unwanted results

独自空忆成欢 提交于 2019-12-08 01:55:02
问题 I am trying to read parquet files using the dask read_parquet method and the filters kwarg. however it sometimes doesn't filter according to the given condition. Example: creating and saving data frame with a dates column import pandas as pd import numpy as np import dask.dataframe as dd nums = range(1,6) dates = pd.date_range('2018-07-01', periods=5, freq='1d') df = pd.DataFrame({'dates':dates, 'nums': nums}) ddf = dd.from_pandas(df, npartitions=3).to_parquet('test_par', engine =

Dask Dataframe split column of list into multiple columns

匆匆过客 提交于 2019-12-07 21:15:30
问题 The same task in Pandas can be easily done with import pandas as pd df = pd.DataFrame({"lists":[[i, i+1] for i in range(10)]}) df[['left','right']] = pd.DataFrame([x for x in df.lists]) But I can't figure out how to do something similar with a dask.dataframe Update So far I found this workaround ddf = dd.from_pandas(df, npartitions=2) ddf["left"] = ddf.apply(lambda x: x["lists"][0], axis=1, meta=pd.Series()) ddf["right"] = ddf.apply(lambda x: x["lists"][1], axis=1, meta=pd.Series()) I'm

Locking in dask.multiprocessing.get and adding metadata to HDF

帅比萌擦擦* 提交于 2019-12-07 17:40:24
问题 Performing an ETL-task in pure Python, I would like to collect error metrics as well as metadata for each of the raw input files considered (error metrics are computed from error codes provided in the data section of the files while metadata is stored in headers). Here's pseudo-code for the whole procedure: import pandas as pd import dask from dask import delayed from dask import dataframe as dd META_DATA = {} # shared resource ERRORS = {} # shared resource def read_file(file_name): global

Shifting all rows in dask dataframe

余生长醉 提交于 2019-12-07 13:33:39
问题 In Pandas, there is a method DataFrame.shift(n) which shifts the contents of an array by n rows, relative to the index, similarly to np.roll(a, n). I can't seem to find a way to get a similar behaviour working with Dask. I realise things like row-shifts may be difficult to manage with Dask's chunked system, but I don't know of a better way to compare each row with the subsequent one. What I'd like to be able to do is this: import numpy as np import pandas as pd import dask.DataFrame as dd

Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?

倖福魔咒の 提交于 2019-12-07 10:42:27
Today I began using the Dask and Paramiko packages, partly as a learning exercise, and partly because I'm beginning a project that will require dealing with large datasets (10s of GB) that must be accessed from a remote VM only (i.e. cannot store locally)– I have login credentials, and sudo rights of this VM. I have minimal data analytics experience, and no experience working with datasets over a few thousand rows in size. The following piece of code belongs to a short, helper program that will make a dask dataframe of a large csv file hosted on the VM. I want to later pass its output

Sorting in Dask

帅比萌擦擦* 提交于 2019-12-07 10:05:41
问题 I want to find an alternative of pandas.dataframe.sort_value function in dask. I came through set_index , but it would sort on a single column. How can I sort multiple columns of Dask data frame? 回答1: So far Dask does not seem to support sorting by multiple columns. However, making a new column that concatenates the values of the sorted columns may be a usable work-around. d['new_column'] = d.apply(lambda r: str([r.col1,r.col2]), axis=1) d = d.set_index('new_column') d = d.map_partitions