How to use boto3 client with Python multiprocessing?

前端 未结 2 1852
滥情空心
滥情空心 2021-01-12 05:11

Code looks something like this:

import multiprocessing as mp
from functools import partial

import boto3
import numpy as np


s3 = boto3.client(\'s3\')

def          


        
2条回答
  •  长情又很酷
    2021-01-12 06:01

    Well, I solved it in a pretty straightforward way. That is, using a more reduced a less complex object rather than . I used the class Bucket.

    However, you should keeping into consideration the following post: Can't pickle when using multiprocessing Pool.map(). I put every object related with boto3 outside any class of function. Some other posts suggest to put s3 objects and functions inside the function you're trying to parallize in order to avoid overhead, I haven't tried yet, though. Indeed, I'll share to you a code in which is possible to save information into a msgpack filetype.

    My code example is as follows (outside any class or function). Hope it helps.

    import pandas as pd
    import boto3
    from pathos.pools import ProcessPool
    
    s3 = boto3.resource('s3')
    s3_bucket_name = 'bucket-name'
    s3_bucket = s3.Bucket(s3_bucket_name)
    
    def msgpack_dump_s3 (df, filename):
        try:
            s3_bucket.put_object(Body=df.to_msgpack(), Key=filename)
            print(module, filename + " successfully saved into s3 bucket '" + s3_bucket.name + "'")
        except Exception as e:
            # logging all the others as warning
            print(module, "Failed deleting bucket. Continuing. {}".format(e))
    
    def msgpack_load_s3 (filename):
        try:
            return s3_bucket.Object(filename).get()['Body'].read()
        except ClientError as ex:
            if ex.response['Error']['Code'] == 'NoSuchKey':
                print(module, 'No object found - returning None')
                return None
            else:
                print(module, "Failed deleting bucket. Continuing. {}".format(ex))
                raise ex
        except Exception as e:
            # logging all the others as warning
            print(module, "Failed deleting bucket. Continuing. {}".format(e))
        return
    
    def upper_function():
    
        def function_to_parallelize(filename):
            file = msgpack_load_s3(filename)
            if file is not None:
                df = pd.read_msgpack(file)
            #do somenthing
    
            print('\t\t\tSaving updated info...')
            msgpack_dump_s3(df, filename)
    
    
            pool = ProcessPool(nodes=ncpus)
            # do an asynchronous map, then get the results
            results = pool.imap(function_to_parallelize, files)
            print("...")
            print(list(results))
            """
            while not results.ready():
                time.sleep(5)
                print(".", end=' ')
    

提交回复
热议问题