I have a bucket with thousands of files in it. How can I search the bucket? Is there a tool you can recommend?
I tried in the following way
aws s3 ls s3://Bucket1/folder1/2019/ --recursive |grep filename.csv
This outputs the actual path where the file exists
2019-04-05 01:18:35 111111 folder1/2019/03/20/filename.csv
There are multiple options, none being simple "one shot" full text solution:
Key name pattern search: Searching for keys starting with some string- if you design key names carefully, then you may have rather quick solution.
Search metadata attached to keys: when posting a file to AWS S3, you may process the content, extract some meta information and attach this meta information in form of custom headers into the key. This allows you to fetch key names and headers without need to fetch complete content. The search has to be done sequentialy, there is no "sql like" search option for this. With large files this could save a lot of network traffic and time.
Store metadata on SimpleDB: as previous point, but with storing the metadata on SimpleDB. Here you have sql like select statements. In case of large data sets you may hit SimpleDB limits, which can be overcome (partition metadata across multiple SimpleDB domains), but if you go really far, you may need to use another metedata type of database.
Sequential full text search of the content - processing all the keys one by one. Very slow, if you have too many keys to process.
We are storing 1440 versions of a file a day (one per minute) for couple of years, using versioned bucket, it is easily possible. But getting some older version takes time, as one has to sequentially go version by version. Sometime I use simple CSV index with records, showing publication time plus version id, having this, I could jump to older version rather quickly.
As you see, AWS S3 is not on it's own designed for full text searches, it is simple storage service.
Here's a short and ugly way to do search file names using the AWS CLI:
aws s3 ls s3://your-bucket --recursive | grep your-search | cut -c 32-
I faced the same problem. Searching in S3 should be much more easier than the current situation. That's why, I implemented this open source tool for searching in S3.
SSEARCH is full open source S3 search tool. It has been implemented always keeping mind that the performance is the critical factor and according to the benchmarks it searches the bucket which contains ~1000 files within seconds.
Installation is simple. You only download docker-compose file and running it with
docker-compose up
SSEARCH will be started and you can search anything in any bucket you have.
I did something as below to find patterns in my bucket
def getListOfPrefixesFromS3(dataPath: String, prefix: String, delimiter: String, batchSize: Integer): List[String] = {
var s3Client = new AmazonS3Client()
var listObjectsRequest = new ListObjectsRequest().withBucketName(dataPath).withMaxKeys(batchSize).withPrefix(prefix).withDelimiter(delimiter)
var objectListing: ObjectListing = null
var res: List[String] = List()
do {
objectListing = s3Client.listObjects(listObjectsRequest)
res = res ++ objectListing.getCommonPrefixes
listObjectsRequest.setMarker(objectListing.getNextMarker)
} while (objectListing.isTruncated)
res
}
For larger buckets this consumes too much of time since all the object summaries are returned by the Aws and not only the ones that match the prefix and the delimiter. I am looking for ways to improve the performance and so far i've only found that i should name the keys and organise them in buckets properly.
S3 doesn't have a native "search this bucket" since the actual content is unknown - also, since S3 is key/value based there is no native way to access many nodes at once ala more traditional datastores that offer a (SELECT * FROM ... WHERE ...)
(in a SQL model).
What you will need to do is perform ListBucket
to get a listing of objects in the bucket and then iterate over every item performing a custom operation that you implement - which is your searching.