dask

Only a column name can be used for the key in a dtype mappings argument

混江龙づ霸主 提交于 2020-01-14 05:38:26
问题 I've successfully brought in one table using dask read_sql_table from a oracle database. However, when I try to bring in another table I get this error KeyError: 'Only a column name can be used for the key in a dtype mappings argument.' I've checked my connection string and schema and all of that is fine. I know the table name exists and the column i'm trying to use as an index is a primary key on the table in the oracle database. Can someone please explain why this error occurs when the

How to create Dask DataFrame from a list of urls?

人盡茶涼 提交于 2020-01-13 13:12:38
问题 I have a list of the URLs, and I'd love to read them to the dask data frame at once, but it looks like read_csv can't use an asterisk for http . Is there any way to achieve that? Here is an example: link = 'http://web.mta.info/developers/' data = [ 'data/nyct/turnstile/turnstile_170128.txt', 'data/nyct/turnstile/turnstile_170121.txt', 'data/nyct/turnstile/turnstile_170114.txt', 'data/nyct/turnstile/turnstile_170107.txt' ] and what I want is df = dd.read_csv('XXXX*X') 回答1: Try using dask

How to create Dask DataFrame from a list of urls?

孤者浪人 提交于 2020-01-13 13:12:13
问题 I have a list of the URLs, and I'd love to read them to the dask data frame at once, but it looks like read_csv can't use an asterisk for http . Is there any way to achieve that? Here is an example: link = 'http://web.mta.info/developers/' data = [ 'data/nyct/turnstile/turnstile_170128.txt', 'data/nyct/turnstile/turnstile_170121.txt', 'data/nyct/turnstile/turnstile_170114.txt', 'data/nyct/turnstile/turnstile_170107.txt' ] and what I want is df = dd.read_csv('XXXX*X') 回答1: Try using dask

After using Dask pivot_table I lose the index column

萝らか妹 提交于 2020-01-13 11:26:48
问题 I am loosing index column after I use pivot_table for Dask Dataframe and save data to Parquet file. import dask.dataframe as dd import pandas as pd df=pd.DataFrame() df["Index"]=[1,2,3,1,2,3] df["Field"]=["A","A","A","B","B","B"] df["Value"]=[10,20,30,100,120,130] df My dataframe: Index Field Value 0 1 A 10 1 2 A 20 2 3 A 30 3 1 B 100 4 2 B 120 5 3 B 130 Dask code: ddf=dd.from_pandas(df,2) ddf=ddf.categorize("Field") ddf=ddf.pivot_table(values="Value", index="Index", columns="Field") dd.to

Dask item assignment. Cannot use loc for item assignment

可紊 提交于 2020-01-13 11:23:20
问题 I have a folder of parquet files that I can't fit in memory so I am using dask to perform the data cleansing operations. I have a function where I want to perform item assignment but I can't seem to find any solutions online that qualify as solutions to this particular function. Below is the function that works in pandas. How do I get the same results in a dask dataframe? I thought delayed might help but all of the solutions I've tried to write haven't been working. def item_assignment(df):

move from pandas to dask to utilize all local cpu cores

做~自己de王妃 提交于 2020-01-12 08:50:39
问题 Recently I stumbled upon http://dask.pydata.org/en/latest/ As I have some pandas code which only runs on a single core I wonder how to make use of my other CPU cores. Would dask work well to use all (local) CPU cores? If yes how compatible is it to pandas? Could I use multiple CPUs with pandas? So far I read about releasing the GIL but that all seems rather complicated. 回答1: Would dask work well to use all (local) CPU cores? Yes. how compatible is it to pandas? Pretty compatible. Not 100%.

What threads do Dask Workers have active?

你离开我真会死。 提交于 2020-01-11 10:08:10
问题 When running a Dask worker I notice that there are a few extra threads beyond what I was expecting. How many threads should I expect to see running from a Dask Worker and what are they doing? 回答1: Dask workers have the following threads: A pool of threads in which to run tasks. This is typically somewhere between 1 and the number of logical cores on the computer One administrative thread to manage the event loop, communication over (non-blocking) sockets, responding to fast queries, the

What threads do Dask Workers have active?

你离开我真会死。 提交于 2020-01-11 10:07:54
问题 When running a Dask worker I notice that there are a few extra threads beyond what I was expecting. How many threads should I expect to see running from a Dask Worker and what are they doing? 回答1: Dask workers have the following threads: A pool of threads in which to run tasks. This is typically somewhere between 1 and the number of logical cores on the computer One administrative thread to manage the event loop, communication over (non-blocking) sockets, responding to fast queries, the

Dask For Loop In Parallel

主宰稳场 提交于 2020-01-11 08:47:07
问题 I am trying to find the correct syntax for using a for loop with dask delayed. I have found several tutorials and other questions but none fit my condition, which is extremely basic. First, is this the correct way to run a for-loop in parallel? %%time list_names=['a','b','c','d'] keep_return=[] @delayed def loop_dummy(target): for i in range (1000000000): pass print('passed value is:'+target) return(1) for i in list_names: c=loop_dummy(i) keep_return.append(c) total = delayed(sum)(keep_return

Does dask dataframe apply preserve rows order?

泪湿孤枕 提交于 2020-01-06 06:33:23
问题 I am considering using a closure with the current state, to compute the rolling window (which in my case is of width 2), to answer my own question, which I have recently posed. Something on the lines of: def test(init_value): def my_fcn(x,y): nonlocal init_value actual_value = (x + y) * init_value init_value = actual_value return init_value return my_fcn where my_fcn is a dummy function used for testing. Therefore the function might be initialised thorugh actual_fcn = test(0); where we assume