How to best share static data between ipyparallel client and remote engines?

蓝咒 提交于 2019-12-04 22:58:12

I have used this logic in a code couple years ago, and I got using this. My code was something like:

shared_dict = {
    # big dict with ~10k keys, each with a list of dicts
}

balancer = engines.load_balanced_view()

with engines[:].sync_imports(): # your 'view' variable 
    import pandas as pd
    import ujson as json

engines[:].push(shared_dict)

results = balancer.map(lambda i: (i, my_func(i)), id)
results_data = results.get()

If simulation counts are small (~50), then it takes a while to get started, but i start to see progress print statements. Strangely, multiple tasks will get assigned to the same engine and I don't see a response until all of those assigned tasks are completed for that engine. I would expect to see a response from enumerate(ar) every time a single simulation task completes.

In my case, my_func() was a complex method where I put lots of logging messages written into a file, so I had my print statements.

About the task assignment, as I used load_balanced_view(), I left to the library find its way, and it did great.

If simulation counts are large (~1000), it takes a long time to get started, i see the CPUs throttle up on all engines, but no progress print statements are seen until a long time (~40mins), and when I do see progress, it appears a large block (>100) of tasks went to same engine, and awaited completion from that one engine before providing some progress. When that one engine did complete, i saw the ar object provided new responses ever 4 secs - this may have been the time delay to write the output pickle files.

About the long time, I haven't experienced that, so I can't say nothing.

I hope this might cast some light in your problem.


PS: as I said in the comment, you could try multiprocessing.Pool. I guess I haven't tried to share a big, read-only data as a global variable using it. I would give a try, because it seems to work.

Sometimes you need to scatter your data grouping by a category, so that you are sure that the each subgroup will be entirely contained by a single cluster.

This is how I usually do it:

# Connect to the clusters
import ipyparallel as ipp
client = ipp.Client()
lview  = client.load_balanced_view()
lview.block = True
CORES = len(client[:])

# Define the scatter_by function
def scatter_by(df,grouper,name='df'):
    sz = df.groupby([grouper]).size().sort_values().index.unique()
    for core in range(CORES):
        ids = sz[core::CORES]
        print("Pushing {0} {1}s into cluster {2}...".format(size(ids),grouper,core))
        client[core].push({name:df[df[grouper].isin(ids)]})

# Scatter the dataframe df grouping by `year`
scatter_by(df,'year')

Notice that the function I'm suggesting scatters makes sure each cluster will host a similar number of observations, which is usually a good idea.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!