How do you search an amazon s3 bucket?

后端 未结 21 2292
渐次进展
渐次进展 2020-11-30 18:00

I have a bucket with thousands of files in it. How can I search the bucket? Is there a tool you can recommend?

相关标签:
21条回答
  • 2020-11-30 18:19

    I tried in the following way

    aws s3 ls s3://Bucket1/folder1/2019/ --recursive |grep filename.csv
    

    This outputs the actual path where the file exists

    2019-04-05 01:18:35     111111 folder1/2019/03/20/filename.csv
    
    0 讨论(0)
  • 2020-11-30 18:21

    There are multiple options, none being simple "one shot" full text solution:

    1. Key name pattern search: Searching for keys starting with some string- if you design key names carefully, then you may have rather quick solution.

    2. Search metadata attached to keys: when posting a file to AWS S3, you may process the content, extract some meta information and attach this meta information in form of custom headers into the key. This allows you to fetch key names and headers without need to fetch complete content. The search has to be done sequentialy, there is no "sql like" search option for this. With large files this could save a lot of network traffic and time.

    3. Store metadata on SimpleDB: as previous point, but with storing the metadata on SimpleDB. Here you have sql like select statements. In case of large data sets you may hit SimpleDB limits, which can be overcome (partition metadata across multiple SimpleDB domains), but if you go really far, you may need to use another metedata type of database.

    4. Sequential full text search of the content - processing all the keys one by one. Very slow, if you have too many keys to process.

    We are storing 1440 versions of a file a day (one per minute) for couple of years, using versioned bucket, it is easily possible. But getting some older version takes time, as one has to sequentially go version by version. Sometime I use simple CSV index with records, showing publication time plus version id, having this, I could jump to older version rather quickly.

    As you see, AWS S3 is not on it's own designed for full text searches, it is simple storage service.

    0 讨论(0)
  • 2020-11-30 18:25

    Here's a short and ugly way to do search file names using the AWS CLI:

    aws s3 ls s3://your-bucket --recursive | grep your-search | cut -c 32-
    
    0 讨论(0)
  • 2020-11-30 18:27

    I faced the same problem. Searching in S3 should be much more easier than the current situation. That's why, I implemented this open source tool for searching in S3.

    SSEARCH is full open source S3 search tool. It has been implemented always keeping mind that the performance is the critical factor and according to the benchmarks it searches the bucket which contains ~1000 files within seconds.

    Installation is simple. You only download docker-compose file and running it with

    docker-compose up
    

    SSEARCH will be started and you can search anything in any bucket you have.

    0 讨论(0)
  • 2020-11-30 18:31

    I did something as below to find patterns in my bucket

    def getListOfPrefixesFromS3(dataPath: String, prefix: String, delimiter: String, batchSize: Integer): List[String] = {
        var s3Client = new AmazonS3Client()
        var listObjectsRequest = new ListObjectsRequest().withBucketName(dataPath).withMaxKeys(batchSize).withPrefix(prefix).withDelimiter(delimiter)
        var objectListing: ObjectListing = null
        var res: List[String] = List()
    
        do {
          objectListing = s3Client.listObjects(listObjectsRequest)
          res = res ++ objectListing.getCommonPrefixes
          listObjectsRequest.setMarker(objectListing.getNextMarker)
        } while (objectListing.isTruncated)
        res
      }
    

    For larger buckets this consumes too much of time since all the object summaries are returned by the Aws and not only the ones that match the prefix and the delimiter. I am looking for ways to improve the performance and so far i've only found that i should name the keys and organise them in buckets properly.

    0 讨论(0)
  • 2020-11-30 18:33

    S3 doesn't have a native "search this bucket" since the actual content is unknown - also, since S3 is key/value based there is no native way to access many nodes at once ala more traditional datastores that offer a (SELECT * FROM ... WHERE ...) (in a SQL model).

    What you will need to do is perform ListBucket to get a listing of objects in the bucket and then iterate over every item performing a custom operation that you implement - which is your searching.

    0 讨论(0)
提交回复
热议问题