How to run dask in multiple machines? [closed]

柔情痞子 提交于 2020-05-28 03:28:18

问题


I found Dask recently. I have very basic questions about Dask Dataframe and other data structures.

  1. Is Dask Dataframe immutable data type?
  2. Is Dask array and Dataframe are lazy data structure?

I dont know whether to use dask or spark or pandas for my situation. I have 200 GB of data to compute. It took 9 hours to compute operations using plain python program. But it can be processed parallelly in lesser time by utilizing 16 core processor. If I split the dataframe in pandas I need to worry about commutative and associative property of my calculations. On the other hand I can use standalone spark cluster to just split up the data and run parallelly.

Do I need to setup any clusters in Dask as like as Spark?
How to run Dask dataframes in my own compute nodes?
Does Dask need master-slave setup?

I am a fan of pandas, so I am looking for solutions similar to pandas.


回答1:


There appear to be a few questions here

Q: Are Dask.dataframes immutable?

Not strictly. They support column assignment. Generally though you're correct that most of the mutation operations of Pandas are not supported

Q: Are Dask.dataframe and Dask.array lazy?

Yes

Q: Do I need to set up a cluster?

No, you can choose to run Dask on a cluster or on a single machine.

Q: If I want to use a cluster how do I do it?

See documentation for Dask.distributed and the setup docs in particular

Q: Should I use Dask, Spark, or Pandas?

This question is overly broad and depends on the situation



来源:https://stackoverflow.com/questions/39439408/how-to-run-dask-in-multiple-machines

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!