问题
I am using python in AWS Lambda function to list keys in a s3 bucket that contains a specific id
for object in mybucket.objects.all():
file_name = os.path.basename(object.key)
match_id = file_name.split('_', 1)[0]
The problem is if a s3 bucket has several thousand files the iteration is very inefficient and sometimes lambda function times out
Here is an example file name
https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg
i want to only iterate objects that contains "012345" in the key name Any good suggestion on how i can accomplish that
回答1:
Here is how you need to solve it.
S3 stores everything as objects and there is no folder or filename. It is all for user convenience.
aws s3 ls s3://bucket/folder1/folder2/filenamepart --recursive
will get all s3 objects name that matches to that name.
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucketname')
for obj in my_bucket.objects.filter(Prefix='012345'):
print(obj)
To speed up the list you can run multiple scripts parallelly.
Hope it helps.
回答2:
You can improve speed by 30-40% by dropping os
and using string methods.
Depending on the assumptions you can make about the file path string, you can get additional speedups:
Using os.path.basename()
:
%%timeit
match = "012345"
fname = "https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg"
os.path.basename(fname).split("_")[0] == match
# 1.03 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Without os
, splitting first on /
and then on _
:
%%timeit
match = "012345"
fname = "https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg"
fname.split("/")[-1].split("_")[0] == match
# 657 ns ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
If you know that the only underscores occur in the actual file name, you can use just one split()
:
%%timeit
match = "012345"
fname = "https://s3.console.aws.amazon.com/s3/object/bucket-name/012345_abc_happy.jpg"
fname.split("_")[0][-6:] == match
# 388 ns ± 5.65 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
来源:https://stackoverflow.com/questions/47878893/aws-s3-list-keys-containing-a-string