If I have existing files on Amazon\'s S3, what\'s the easiest way to get their md5sum without having to download the files?
Thanks
This is a very old question, but I had a hard time find the information below, and this is one of the first places I could find, so I wanted to detail it in case anyone needs.
ETag is a MD5. But for the Multipart uploaded files, the MD5 is computed from the concatenation of the MD5s of each uploaded part. So you don't need to compute the MD5 in the server. Just get the ETag and it's all.
As @EmersonFarrugia said in this answer:
Say you uploaded a 14MB file and your part size is 5MB. Calculate 3 MD5 checksums corresponding to each part, i.e. the checksum of the first 5MB, the second 5MB, and the last 4MB. Then take the checksum of their concatenation. Since MD5 checksums are hex representations of binary data, just make sure you take the MD5 of the decoded binary concatenation, not of the ASCII or UTF-8 encoded concatenation. When that's done, add a hyphen and the number of parts to get the ETag.
So the only other things you need is the ETag and the upload part size. But the ETag has a -NumberOfParts suffix. So you can divide the size by the suffix and get part size. 5Mb is the minimum part size and the default value. The part size has to be integer, so you can't get things like 7,25Mb each part size. So it should be easy get the part size information.
Here is a script to make this in osx, with a Linux version in comments: https://gist.github.com/emersonf/7413337
I'll leave both script here in case the page above is no longer accessible in the future:
Linux version:
#!/bin/bash
set -euo pipefail
if [ $# -ne 2 ]; then
echo "Usage: $0 file partSizeInMb";
exit 0;
fi
file=$1
if [ ! -f "$file" ]; then
echo "Error: $file not found."
exit 1;
fi
partSizeInMb=$2
fileSizeInMb=$(du -m "$file" | cut -f 1)
parts=$((fileSizeInMb / partSizeInMb))
if [[ $((fileSizeInMb % partSizeInMb)) -gt 0 ]]; then
parts=$((parts + 1));
fi
checksumFile=$(mktemp -t s3md5.XXXXXXXXXXXXX)
for (( part=0; part<$parts; part++ ))
do
skip=$((partSizeInMb * part))
$(dd bs=1M count=$partSizeInMb skip=$skip if="$file" 2> /dev/null | md5sum >> $checksumFile)
done
etag=$(echo $(xxd -r -p $checksumFile | md5sum)-$parts | sed 's/ --/-/')
echo -e "${1}\t${etag}"
rm $checksumFile
OSX version:
#!/bin/bash
if [ $# -ne 2 ]; then
echo "Usage: $0 file partSizeInMb";
exit 0;
fi
file=$1
if [ ! -f "$file" ]; then
echo "Error: $file not found."
exit 1;
fi
partSizeInMb=$2
fileSizeInMb=$(du -m "$file" | cut -f 1)
parts=$((fileSizeInMb / partSizeInMb))
if [[ $((fileSizeInMb % partSizeInMb)) -gt 0 ]]; then
parts=$((parts + 1));
fi
checksumFile=$(mktemp -t s3md5)
for (( part=0; part<$parts; part++ ))
do
skip=$((partSizeInMb * part))
$(dd bs=1m count=$partSizeInMb skip=$skip if="$file" 2>/dev/null | md5 >>$checksumFile)
done
echo $(xxd -r -p $checksumFile | md5)-$parts
rm $checksumFile
This works for me. In PHP, you can compare the checksum between local file e amazon file using this:
// get localfile md5
$checksum_local_file = md5_file ( '/home/file' );
// compare checksum between localfile and s3file
public function compareChecksumFile($file_s3, $checksum_local_file) {
$Connection = new AmazonS3 ();
$bucket = amazon_bucket;
$header = $Connection->get_object_headers( $bucket, $file_s3 );
// get header
if (empty ( $header ) || ! is_object ( $header )) {
throw new RuntimeException('checksum error');
}
$head = $header->header;
if (empty ( $head ) || !is_array($head)) {
throw new RuntimeException('checksum error');
}
// get etag (md5 amazon)
$etag = $head['etag'];
if (empty ( $etag )) {
throw new RuntimeException('checksum error');
}
// remove quotes
$checksumS3 = str_replace('"', '', $etag);
// compare md5
if ($checksum_local_file === $checksumS3) {
return TRUE;
} else {
return FALSE;
}
}
ETag does not seem to be MD5 for multipart uploads (as per Gael Fraiteur's comment). In these cases it contains a suffix of minus and a number. However, even the bit before the minus does not seem to be the MD5, even though it is the same length as an MD5. Possibly the suffix is the number of parts uploaded?
The easiest way would be to set the checksum yourself as metadata before you upload these files to your bucket :
ObjectMetadata md = new ObjectMetadata();
md.setContentMD5("foobar");
PutObjectRequest req = new PutObjectRequest(BUCKET, KEY, new File("/path/to/file")).withMetadata(md);
tm.upload(req).waitForUploadResult();
Now you can access these metadata without downloading the file :
ObjectMetadata md2 = s3Client.getObjectMetadata(BUCKET, KEY);
System.out.println(md.getContentMD5());
source : https://github.com/aws/aws-sdk-java/issues/1711
Here's the code to get the S3 ETag for an object in PowerShell converted from c#.
function Get-ETag {
[CmdletBinding()]
param(
[Parameter(Mandatory=$true)]
[string]$Path,
[Parameter(Mandatory=$true)]
[int]$ChunkSizeInMb
)
$returnMD5 = [string]::Empty
[int]$chunkSize = $ChunkSizeInMb * [Math]::Pow(2, 20)
$crypto = New-Object System.Security.Cryptography.MD5CryptoServiceProvider
[int]$hashLength = $crypto.HashSize / 8
$stream = [System.IO.File]::OpenRead($Path)
if($stream.Length -gt $chunkSize) {
$chunkCount = [int][Math]::Ceiling([double]$stream.Length / [double]$chunkSize)
[byte[]]$hash = New-Object byte[]($chunkCount * $hashLength)
$hashStream = New-Object System.IO.MemoryStream(,$hash)
[long]$numBytesLeftToRead = $stream.Length
while($numBytesLeftToRead -gt 0) {
$numBytesCurrentRead = [int][Math]::Min($numBytesLeftToRead, $chunkSize)
$buffer = New-Object byte[] $numBytesCurrentRead
$numBytesLeftToRead -= $stream.Read($buffer, 0, $numBytesCurrentRead)
$tmpHash = $crypto.ComputeHash($buffer)
$hashStream.Write($tmpHash, 0, $hashLength)
}
$returnMD5 = [System.BitConverter]::ToString($crypto.ComputeHash($hash)).Replace("-", "").ToLower() + "-" + $chunkCount
}
else {
$returnMD5 = [System.BitConverter]::ToString($crypto.ComputeHash($stream)).Replace("-", "").ToLower()
}
$stream.Close()
$returnMD5
}