I have a bucket with thousands of files in it. How can I search the bucket? Is there a tool you can recommend?
directly in the AWS Console bucket view.
When you have thousands or millions of files another way to get the wanted files is to copy them to another location using distributed copy. You run this on EMR in a Hadoop Job. The cool thing about AWS is that they provide their custom S3 version s3-dist-cp. It allows you to group wanted files using a regular expression in the groupBy field. You can use this for example in a custom step on EMR
[
{
"ActionOnFailure": "CONTINUE",
"Args": [
"s3-dist-cp",
"--s3Endpoint=s3.amazonaws.com",
"--src=s3://mybucket/",
"--dest=s3://mytarget-bucket/",
"--groupBy=MY_PATTERN",
"--targetSize=1000"
],
"Jar": "command-runner.jar",
"Name": "S3DistCp Step Aggregate Results",
"Type": "CUSTOM_JAR"
}
]
Try this command:
aws s3api list-objects --bucket your-bucket --prefix sub-dir-path --output text --query 'Contents[].{Key: Key}'
Then you can pipe this into a grep to get specific file types to do whatever you want with them.
Fast forward to 2020, and using aws-okta as our 2fa, the following command, while slow as hell to iterate through all of the objects and folders in this particular bucket (+270,000) worked fine.
aws-okta exec dev -- aws s3 ls my-cool-bucket --recursive | grep needle-in-haystax.txt