How do I translate an AWS S3 url into a bucket name for boto?

匿名 (未验证) 提交于 2019-12-03 00:59:01

问题:

I'm trying to access the http://s3.amazonaws.com/commoncrawl/parse-output/segment/ bucket with boto. I can't figure out how to translate this into a name for boto.s3.bucket.Bucket().

This is the gist of what I'm going for:

s3 = boto.connect_s3() cc = boto.s3.bucket.Bucket(connection=s3, name='commoncrawl/parse-output/segment') requester = {'x-amz-request-payer':'requester'} contents = cc.list(headers=requester) for i,item in enumerate(contents):     print item.__repr__() 

I get "boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request ... The specified bucket is not valid..."

回答1:

The bucket name would be commoncrawl. Everything that appears after that is really just part of the name of the keys that appear in the bucket.



回答2:

The AWS documents list four possible url formats for S3 -- here's something I just threw together to extract the bucket and region for all of the different url formats.

import re  def bucket_name_from_url(url):     """ Gets bucket name and region from url, matching any of the different formats for S3 urls      * http://bucket.s3.amazonaws.com     * http://bucket.s3-aws-region.amazonaws.com     * http://s3.amazonaws.com/bucket     * http://s3-aws-region.amazonaws.com/bucket      returns bucket name, region     """            match =  re.search('^https?://([^.]+).s3.amazonaws.com/', url)     if match:         return match.group(1), None      match =  re.search('^https?://([^.]+).s3-([^.]+).amazonaws.com/', url)     if match:         return match.group(1), match.group(2)      match = re.search('^https?://s3.amazonaws.com/([^\/]+)', url)     if match:         return match.group(1), None      match =  re.search('^https?://s3-([^.]+).amazonaws.com/([^\/]+)', url)     if match:         return match.group(2), match.group(1)      return None, None 

Something like this should really go into boto ... Amazon, I hope you're listening



回答3:

Extended Marks answer to return keys

#!/usr/bin/env python  import re  def parse_s3_url(url):     # returns bucket_name, region, key      bucket_name = None     region = None     key = None      # http://bucket.s3.amazonaws.com/key1/key2     match = re.search('^https?://([^.]+).s3.amazonaws.com(.*?)$', url)     if match:         bucket_name, key = match.group(1), match.group(2)      # http://bucket.s3-aws-region.amazonaws.com/key1/key2     match = re.search('^https?://([^.]+).s3-([^\.]+).amazonaws.com(.*?)$', url)     if match:         bucket_name, region, key = match.group(1), match.group(2), match.group(3)      # http://s3.amazonaws.com/bucket/key1/key2     match = re.search('^https?://s3.amazonaws.com/([^\/]+)(.*?)$', url)     if match:         bucket_name, key = match.group(1), match.group(2)      # http://s3-aws-region.amazonaws.com/bucket/key1/key2     match = re.search('^https?://s3-([^.]+).amazonaws.com/([^\/]+)(.*?)$', url)     if match:         bucket_name, region, key = match.group(2), match.group(1), match.group(3)      return list( map(lambda x: x.strip('/') if x else None, [bucket_name, region, key] ) ) 


回答4:

Basing on Mark's answer I've made a small pyparsing script that is clearer to me (include possible key matches):

#!/usr/bin/env python  from pyparsing import Word, alphanums, Or, Optional, Combine  schema = Or(['http://', 'https://']).setResultsName('schema') word = Word(alphanums + '-', min=1) bucket_name = word.setResultsName('bucket') region = word.setResultsName('region')  key = Optional('/' + word.setResultsName('key'))  "bucket.s3.amazonaws.com" opt1 = Combine(schema + bucket_name + '.s3.amazonaws.com' + key)  "bucket.s3-aws-region.amazonaws.com" opt2 = Combine(schema + bucket_name + '.' + region + '.amazonaws.com' + key)  "s3.amazonaws.com/bucket" opt3 = Combine(schema + 's3.amazonaws.com/' + bucket_name + key)  "s3-aws-region.amazonaws.com/bucket" opt4 = Combine(schema + region + ".amazonaws.com/" + bucket_name + key)  tests = [     "http://bucket-name.s3.amazonaws.com",     "https://bucket-name.s3-aws-region-name.amazonaws.com",     "http://s3.amazonaws.com/bucket-name",     "https://s3-aws-region-name.amazonaws.com/bucket-name",     "http://bucket-name.s3.amazonaws.com/key-name",     "https://bucket-name.s3-aws-region-name.amazonaws.com/key-name",     "http://s3.amazonaws.com/bucket-name/key-name",     "https://s3-aws-region-name.amazonaws.com/bucket-name/key-name", ]  s3_url = Or([opt1, opt2, opt3, opt4]).setResultsName('url')  for test in tests:     result = s3_url.parseString(test)     print "found url: " + str(result.url)     print "schema: " + str(result.schema)     print "bucket name: " + str(result.bucket)     print "key name: " + str(result.key) 

Originally I made Mark's script also retrieve the key (object):

def parse_s3_url(url):     """ Gets bucket name and region from url, matching any of the different formats for S3 urls     * http://bucket.s3.amazonaws.com     * http://bucket.s3-aws-region.amazonaws.com     * http://s3.amazonaws.com/bucket     * http://s3-aws-region.amazonaws.com/bucket      returns bucket name, region     """     match = re.search('^https?://([^.]+).s3.amazonaws.com(/\([^.]+\))', url)     if match:         return match.group(1), None, match.group(2)      match = re.search('^https?://([^.]+).s3-([^.]+).amazonaws.com/', url)     if match:         return match.group(1), match.group(2), match.group(3)      match = re.search('^https?://s3.amazonaws.com/([^\/]+)', url)     if match:         return match.group(1), None, match.group(2)      match = re.search('^https?://s3-([^.]+).amazonaws.com/([^\/]+)', url)     if match:         return match.group(2), match.group(1), match.group(3)      return None, None, None 


回答5:

Here it is my JS version:

function parseS3Url(url) {   // Process all aws s3 url cases    url = decodeURIComponent(url);   let match = "";    // http://s3.amazonaws.com/bucket/key1/key2   match = url.match(/^https?:\/\/s3.amazonaws.com\/([^\/]+)\/?(.*?)$/);   if (match) {     return {       bucket: match[1],       key: match[2],       region: ""     };   }    // http://s3-aws-region.amazonaws.com/bucket/key1/key2   match = url.match(/^https?:\/\/s3-([^.]+).amazonaws.com\/([^\/]+)\/?(.*?)$/);   if (match) {     return {       bucket: match[2],       key: match[3],       region: match[1]     };   }    // http://bucket.s3.amazonaws.com/key1/key2   match = url.match(/^https?:\/\/([^.]+).s3.amazonaws.com\/?(.*?)$/);   if (match) {     return {       bucket: match[1],       key: match[2],       region: ""     };   }    // http://bucket.s3-aws-region.amazonaws.com/key1/key2   match = url.match(/^https?:\/\/([^.]+).s3-([^\.]+).amazonaws.com\/?(.*?)$/);   if (match) {     return {       bucket: match[1],       key: match[3],       region: match[2]     };   }    return {     bucket: "",     key: "",     region: ""   }; } 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!