If I have existing files on Amazon\'s S3, what\'s the easiest way to get their md5sum without having to download the files?
Thanks
I have cross checked jets3t and management console against uploaded files' MD5sum, and ETag seems to be equal to MD5sum. You can just view properties of the file in AWS management console:
https://console.aws.amazon.com/s3/home
I found that s3cmd has a --list-md5 option that can be used with the ls command, e.g.
s3cmd ls --list-md5 s3://bucket_of_mine/
Hope this helps.
Below that's work for me to compare local file checksum with s3 etag. I used Python
def md5_checksum(filename):
m = hashlib.md5()
with open(filename, 'rb') as f:
for data in iter(lambda: f.read(1024 * 1024), b''):
m.update(data)
return m.hexdigest()
def etag_checksum(filename, chunk_size=8 * 1024 * 1024):
md5s = []
with open(filename, 'rb') as f:
for data in iter(lambda: f.read(chunk_size), b''):
md5s.append(hashlib.md5(data).digest())
m = hashlib.md5(b"".join(md5s))
print('{}-{}'.format(m.hexdigest(), len(md5s)))
return '{}-{}'.format(m.hexdigest(), len(md5s))
def etag_compare(filename, etag):
et = etag[1:-1] # strip quotes
print('et',et)
if '-' in et and et == etag_checksum(filename):
return True
if '-' not in et and et == md5_checksum(filename):
return True
return False
def main():
session = boto3.Session(
aws_access_key_id=s3_accesskey,
aws_secret_access_key=s3_secret
)
s3 = session.client('s3')
obj_dict = s3.get_object(Bucket=bucket_name, Key=your_key)
etag = (obj_dict['ETag'])
print('etag', etag)
validation = etag_compare(filename,etag)
print(validation)
etag_checksum(filename, chunk_size=8 * 1024 * 1024)
return validation
AWS's documentation of ETag
says:
The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:
- Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.
- Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.
- If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.
Reference: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
For anyone who spend time to search around to find out that why the md5 not the same as ETag in S3.
ETag will calculate against chuck of data and concat all md5hash to make md5 hash again and keep the number of chunk at the end.
Here is C# version to generate hash
string etag = HashOf("file.txt",8);
source code
private string HashOf(string filename,int chunkSizeInMb)
{
string returnMD5 = string.Empty;
int chunkSize = chunkSizeInMb * 1024 * 1024;
using (var crypto = new MD5CryptoServiceProvider())
{
int hashLength = crypto.HashSize/8;
using (var stream = File.OpenRead(filename))
{
if (stream.Length > chunkSize)
{
int chunkCount = (int)Math.Ceiling((double)stream.Length/(double)chunkSize);
byte[] hash = new byte[chunkCount*hashLength];
Stream hashStream = new MemoryStream(hash);
long nByteLeftToRead = stream.Length;
while (nByteLeftToRead > 0)
{
int nByteCurrentRead = (int)Math.Min(nByteLeftToRead, chunkSize);
byte[] buffer = new byte[nByteCurrentRead];
nByteLeftToRead -= stream.Read(buffer, 0, nByteCurrentRead);
byte[] tmpHash = crypto.ComputeHash(buffer);
hashStream.Write(tmpHash, 0, hashLength);
}
returnMD5 = BitConverter.ToString(crypto.ComputeHash(hash)).Replace("-", string.Empty).ToLower()+"-"+ chunkCount;
}
else {
returnMD5 = BitConverter.ToString(crypto.ComputeHash(stream)).Replace("-", string.Empty).ToLower();
}
stream.Close();
}
}
return returnMD5;
}
Here is the code to get MD5 hash as per 2017
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import org.apache.commons.codec.binary.Base64;
public class GenerateMD5 {
public static void main(String args[]) throws Exception{
String s = "<CORSConfiguration> <CORSRule> <AllowedOrigin>http://www.example.com</AllowedOrigin> <AllowedMethod>PUT</AllowedMethod> <AllowedMethod>POST</AllowedMethod> <AllowedMethod>DELETE</AllowedMethod> <AllowedHeader>*</AllowedHeader> <MaxAgeSeconds>3000</MaxAgeSeconds> </CORSRule> <CORSRule> <AllowedOrigin>*</AllowedOrigin> <AllowedMethod>GET</AllowedMethod> <AllowedHeader>*</AllowedHeader> <MaxAgeSeconds>3000</MaxAgeSeconds> </CORSRule> </CORSConfiguration>";
MessageDigest md = MessageDigest.getInstance("MD5");
md.update(s.getBytes());
byte[] digest = md.digest();
StringBuffer sb = new StringBuffer();
/*for (byte b : digest) {
sb.append(String.format("%02x", b & 0xff));
}*/
System.out.println(sb.toString());
StringBuffer sbi = new StringBuffer();
byte [] bytes = Base64.encodeBase64(digest);
String finalString = new String(bytes);
System.out.println(finalString);
}
}
The commented code is where most people get it wrong changing it to hex