How do you search an amazon s3 bucket?

后端未结

关注

 21  2327

渐次进展

I have a bucket with thousands of files in it. How can I search the bucket? Is there a tool you can recommend?

相关标签:

21条回答

一整个雨季

2020-11-30 18:19
I tried in the following way
```
aws s3 ls s3://Bucket1/folder1/2019/ --recursive |grep filename.csv
```
This outputs the actual path where the file exists
```
2019-04-05 01:18:35     111111 folder1/2019/03/20/filename.csv
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
我寻月下人不归

2020-11-30 18:21
There are multiple options, none being simple "one shot" full text solution:
1. Key name pattern search: Searching for keys starting with some string- if you design key names carefully, then you may have rather quick solution.
2. Search metadata attached to keys: when posting a file to AWS S3, you may process the content, extract some meta information and attach this meta information in form of custom headers into the key. This allows you to fetch key names and headers without need to fetch complete content. The search has to be done sequentialy, there is no "sql like" search option for this. With large files this could save a lot of network traffic and time.
3. Store metadata on SimpleDB: as previous point, but with storing the metadata on SimpleDB. Here you have sql like select statements. In case of large data sets you may hit SimpleDB limits, which can be overcome (partition metadata across multiple SimpleDB domains), but if you go really far, you may need to use another metedata type of database.
4. Sequential full text search of the content - processing all the keys one by one. Very slow, if you have too many keys to process.
We are storing 1440 versions of a file a day (one per minute) for couple of years, using versioned bucket, it is easily possible. But getting some older version takes time, as one has to sequentially go version by version. Sometime I use simple CSV index with records, showing publication time plus version id, having this, I could jump to older version rather quickly.

As you see, AWS S3 is not on it's own designed for full text searches, it is simple storage service.
0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2020-11-30 18:25
Here's a short and ugly way to do search file names using the AWS CLI:
```
aws s3 ls s3://your-bucket --recursive | grep your-search | cut -c 32-
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
栀梦

2020-11-30 18:27
I faced the same problem. Searching in S3 should be much more easier than the current situation. That's why, I implemented this open source tool for searching in S3.

SSEARCH is full open source S3 search tool. It has been implemented always keeping mind that the performance is the critical factor and according to the benchmarks it searches the bucket which contains ~1000 files within seconds.

Installation is simple. You only download docker-compose file and running it with
```
docker-compose up
```
SSEARCH will be started and you can search anything in any bucket you have.
0 讨论(0)
发布评论:

提交评论
- 加载中...

感情败类

2020-11-30 18:31

I did something as below to find patterns in my bucket

def getListOfPrefixesFromS3(dataPath: String, prefix: String, delimiter: String, batchSize: Integer): List[String] = {
    var s3Client = new AmazonS3Client()
    var listObjectsRequest = new ListObjectsRequest().withBucketName(dataPath).withMaxKeys(batchSize).withPrefix(prefix).withDelimiter(delimiter)
    var objectListing: ObjectListing = null
    var res: List[String] = List()

    do {
      objectListing = s3Client.listObjects(listObjectsRequest)
      res = res ++ objectListing.getCommonPrefixes
      listObjectsRequest.setMarker(objectListing.getNextMarker)
    } while (objectListing.isTruncated)
    res
  }

For larger buckets this consumes too much of time since all the object summaries are returned by the Aws and not only the ones that match the prefix and the delimiter. I am looking for ways to improve the performance and so far i've only found that i should name the keys and organise them in buckets properly.

0 讨论(0)

春和景丽

2020-11-30 18:33

S3 doesn't have a native "search this bucket" since the actual content is unknown - also, since S3 is key/value based there is no native way to access many nodes at once ala more traditional datastores that offer a (SELECT * FROM ... WHERE ...) (in a SQL model).

What you will need to do is perform ListBucket to get a listing of objects in the bucket and then iterate over every item performing a custom operation that you implement - which is your searching.

0 讨论(0)
发布评论:

提交评论
- 加载中...