Last modified Time file list in aws s3 using python

不羁岁月 提交于 2020-03-03 02:55:48

问题


I have multiple keys under my aws s3 bucket. The structure is :

bucket/tableName1/Archive/archive1.json -to- bucket/tableName1/Archive/archiveN.json bucket/tableName2/Archive/archive2.json -to- bucket/tableName2/Archive/archiveN.json bucket/tableName1/Audit/audit1.json -to- bucket/tableName1/Audit/auditN.json bucket/tableName2/Audit/audit2.json -to- bucket/tableName2/Audit/auditN.json

I want to get the keys from the Audit folder only if it is present in a key and get only the the latest file i.e. which has the last modified time as most recent from that Audit folder.

The result that I am trying to get is a list of dictionary :

[{'tableName1' : 'auditN.json'}, {'tableName2' : 'auditN.json'}]

Assuming auditN.json is the newest file.

I tried different methods but i am not getting the desired result.I am trying the solution on databricks notebook. Is there a way that I can achieve this ?


回答1:


Well, I've been reading and searching over a lot of threads about what you're asking but no luck. So, I had to write my own lambda function.

The following code snippet iterate over all folders, then iterate over the subfolders check if the subfolder name == Audit, if it does- sort by last modified and print the newest object.

Be aware that this code fits your structure only! since list_folders function return only the first subfolders.

In case your structure changed to something like that:

bucket/tableName1/Audit/Audit1/audit.json

The lambda won't work.

Code snippet :

import boto3

#bucket Name
bucket_name = 'Bucket Name'
#bucket Resource
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)

#bucket client
s3_client = boto3.client("s3")

#filter pattern 
get_last_modified = lambda obj: int(obj.last_modified.strftime('%s'))

#get subfolder - 1 LEVEL ONLY ! 
def list_folders(s3_client, bucket_name,prefix):
    response = s3_client.list_objects_v2(Bucket=bucket_name,Prefix=prefix, Delimiter='/')
    for content in response.get('CommonPrefixes', []):
        yield content.get('Prefix')

def lambda_handler(event, context):
    #get all folders 
    folder_list = list_folders(s3_client, bucket_name,'')
    for folder in folder_list:
        #get all subfolders
        subfolders =  list_folders(s3_client, bucket_name,folder)
        for subfolder in subfolders:
            #iterate over subfolders and check if subfolder name equal to Audit
            if 'Audit' == subfolder.split('/')[1]:
                #get all objects under subfolder
                objs = [obj for obj in bucket.objects.filter(Prefix= subfolder)]
                #sort by last modified by filter pattern and get the first object 
                last_modified_file = [obj for obj in sorted(objs, key=get_last_modified)][-1]
                #print results
                print('Last modified file Name: %s ---- Date: %s' % (last_modified_file.key,last_modified_file.last_modified))

Tested against the following files:

Table2 subfolder named Archive.

Output :

Hope you will find it helpful.



来源:https://stackoverflow.com/questions/58719497/last-modified-time-file-list-in-aws-s3-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!