How to use AWS CLI to only copy files in S3 bucket that match a given string pattern

I'm using the AWS CLI to copy files from an S3 bucket to my R machine using a command like below:

  system(
    "aws s3 cp s3://my_bucket_location/ ~/my_r_location/ --recursive --exclude '*' --include '*trans*' --region us-east-1"
    )

This works as expected, i.e. it copies all files in my_bucket_location that have "trans" in the filename at that location.

The problem that I am facing is that I have other files with similar naming conventions that I don't want to import in this step. As an example, in the list below I only want to copy the first two files, not the last two:

File list
trans_120215.csv
trans_130215.csv
sum_trans_120215.csv
sum_trans_130215.csv

If I was using regex I could make it more specific like "^trans_\\d+" to bring in just the first two files, but this doesn't seem possible using AWS CLI. So my question is there a way to have more complex pattern matching using AWS CLI like below?

  system(
    "aws s3 cp s3://my_bucket_location/ ~/my_r_location/ --recursive --exclude '*' --include '^trans_\\d+' --region us-east-1"
    )

Please note that I can only use information about the file in question, i.e. that I want to import a file with pattern "^trans_\\d+", I can't use the fact that the other unwanted files contain sum_ at the start, because this is only an example there could be other files with similar names like "check_trans_120215.csv".

I have considered other alternatives like below, but hoping there is a way to adjust the copy command to avoid going down either of these routes:

Listing all items in the bucket > using regex in R to specify the files that I want > Only importing those files
Keeping the copy command as it is > delete unwanted files on the R machine after the copy

The alternatives that you have listed are the best options because S3 CLI doesn't support regex.

Use of Exclude and Include Filters:

Currently, there is no support for the use of UNIX style wildcards in a command's path arguments. However, most commands have --exclude "" and --include "" parameters that can achieve the desired result. These parameters perform pattern matching to either exclude or include a particular file or object. The following pattern symbols are supported.

*: Matches everything
?: Matches any single character
[sequence]: Matches any character in sequence
[!sequence]: Matches any character not in sequence

Putting this here for others to find, since I just had to figure this out. Here's what I came up with:

s3cmd del $(s3cmd ls s3://[BUCKET]/ | grep '.*s3://[BUCKET]/[FILENAME]' | cut -c 41-)

You can put the regex in the grep search string. For instance, I was searching for specific files to delete (hence the s3cmd del). My regex looked like: '2016-11-04.*s3.*[DN][RS].*'. You may have to adjust the cut for your use. Should also work with s3cmd get.

here is the same solution for deletion, you can replace rm with cp you can do it using aws cli : https://aws.amazon.com/cli/ and some unix command.

this aws cli commands should work:

aws s3 rm s3://<your_bucket_name> --exclude "*" --include "<your_regex>"

if you want to include sub-folders you should add the flag --recursive

or with unix commands:

aws s3 ls s3://<your_bucket_name>/ | awk '{print $4}' | xargs -I%  <your_os_shell>   -c 'aws s3 rm s3:// <your_bucket_name>/% $1'

explanation:

list all files on the bucket --pipe-->
get the 4th parameter(its the file name) --pipe--> // you can replace it with linux command to match your pattern
run delete script with aws cli

来源：https://stackoverflow.com/questions/36215713/how-to-use-aws-cli-to-only-copy-files-in-s3-bucket-that-match-a-given-string-pat

标签

amazon-web-services

amazon-s3

aws-cli