问题
I'm using the AWS CLI to copy files from an S3 bucket to my R machine using a command like below:
system(
"aws s3 cp s3://my_bucket_location/ ~/my_r_location/ --recursive --exclude '*' --include '*trans*' --region us-east-1"
)
This works as expected, i.e. it copies all files in my_bucket_location that have "trans" in the filename at that location.
The problem that I am facing is that I have other files with similar naming conventions that I don't want to import in this step. As an example, in the list below I only want to copy the first two files, not the last two:
File list
trans_120215.csv
trans_130215.csv
sum_trans_120215.csv
sum_trans_130215.csv
If I was using regex I could make it more specific like "^trans_\\d+" to bring in just the first two files, but this doesn't seem possible using AWS CLI. So my question is there a way to have more complex pattern matching using AWS CLI like below?
system(
"aws s3 cp s3://my_bucket_location/ ~/my_r_location/ --recursive --exclude '*' --include '^trans_\\d+' --region us-east-1"
)
Please note that I can only use information about the file in question, i.e. that I want to import a file with pattern "^trans_\\d+", I can't use the fact that the other unwanted files contain sum_ at the start, because this is only an example there could be other files with similar names like "check_trans_120215.csv".
I have considered other alternatives like below, but hoping there is a way to adjust the copy command to avoid going down either of these routes:
- Listing all items in the bucket > using regex in R to specify the files that I want > Only importing those files
- Keeping the copy command as it is > delete unwanted files on the R machine after the copy
回答1:
The alternatives that you have listed are the best options because S3 CLI doesn't support regex.
Use of Exclude and Include Filters:
Currently, there is no support for the use of UNIX style wildcards in a command's path arguments. However, most commands have --exclude "" and --include "" parameters that can achieve the desired result. These parameters perform pattern matching to either exclude or include a particular file or object. The following pattern symbols are supported.
*: Matches everything
?: Matches any single character
[sequence]: Matches any character in sequence
[!sequence]: Matches any character not in sequence
回答2:
Putting this here for others to find, since I just had to figure this out. Here's what I came up with:
s3cmd del $(s3cmd ls s3://[BUCKET]/ | grep '.*s3://[BUCKET]/[FILENAME]' | cut -c 41-)
You can put the regex in the grep search string. For instance, I was searching for specific files to delete (hence the s3cmd del). My regex looked like: '2016-11-04.*s3.*[DN][RS].*'. You may have to adjust the cut for your use. Should also work with s3cmd get.
回答3:
here is the same solution for deletion, you can replace rm with cp you can do it using aws cli : https://aws.amazon.com/cli/ and some unix command.
this aws cli commands should work:
aws s3 rm s3://<your_bucket_name> --exclude "*" --include "<your_regex>"
if you want to include sub-folders you should add the flag --recursive
or with unix commands:
aws s3 ls s3://<your_bucket_name>/ | awk '{print $4}' | xargs -I% <your_os_shell> -c 'aws s3 rm s3:// <your_bucket_name>/% $1'
explanation:
- list all files on the bucket --pipe-->
- get the 4th parameter(its the file name) --pipe--> // you can replace it with linux command to match your pattern
- run delete script with aws cli
来源:https://stackoverflow.com/questions/36215713/how-to-use-aws-cli-to-only-copy-files-in-s3-bucket-that-match-a-given-string-pat