How is the checksum calculated in the blobs table for rails ActiveStorage

时间秒杀一切 提交于 2019-12-02 06:57:56

Lets Break It Down

I know i'm a bit late to the party, but this is more for those that come across this in a search for answers. So here it is..

Background:

Rails introduced loads of new features in version 5.2, one of which was ActiveStorage. The official final release came out on April 9th, 2018.

Disclaimer:

So to be perfectly clear, the following information pertains to out-of-the-box vanilla active storage. This also doesn't take into account some crazy code-fu that revolves around some one off scenario.

With that said, the checksum is calculated differently depending on your Active Storage setup. With the vanilla out-of-the-box Rails Active Storage, there are 2 "types" (for lack of a better term) of configuration.

  1. Proxy Uploads
  2. Direct Uploads

Proxy Uploads

File Upload Flow: [Client] → [RoR App] → [Storage Service]

Comm. Flow: Can vary but in most cases it should be similar to File upload flow.

Pointed out above in SparkBao's answer is a "Proxy Upload". Meaning you upload the file to your RoR application and perform some sort of processing before sending the file to your configured storage service (AWS, Azure, Google, BackBlaze, etc...). Even if you set your storage service to "localdisk" the logic still technically applies, even though your RoR application is the storage endpoint.

A "Proxy Upload" approach isn't ideal for RoR applications that are deployed in the cloud on services like Heroku. Heroku has a hardset limit of 30 seconds to complete your transaction and send a response back to your client (end user). So if your file is fairly large, you need to consider the time it takes for your file to upload, and then account for the amount of time to calculate the checksum. If your caught in a scenario where you can't complete the request with a response in the 30 seconds you will need to use the "Direct Upload" approach.

Proxy Uploads Answer:

The Ruby class Digest::MD5 is used in the method compute_checksum_in_chunks(io) as pointed out by Spark.Bao.


Direct Uploads

File Upload Flow: [Client] → [Storage Service]

Comm. Flow: [Client] → [RoR App] → [Client] → [Storage Service] → [Client] → [RoR App] → [Client]

Our fine friends that maintain and develop Rails have already done all the heavy lifting for us. I won't go into details on how to setup a direct upload, but here is a link on how » Rails EdgeGuide - Direct Uploads.

Proxy Uploads Answer:

Now with all that said, with a vanilla out-of-the-box "Direct Uploads" setup, a file checksum is calculated by leveraging SparkMD5 (JavaScript).

Below is a snippet from the Rails Active Storage Source Code- (activestorage.js)

  var fileSlice = File.prototype.slice || File.prototype.mozSlice || File.prototype.webkitSlice;
  var FileChecksum = function() {
    createClass(FileChecksum, null, [ {
      key: "create",
      value: function create(file, callback) {
        var instance = new FileChecksum(file);
        instance.create(callback);
      }
    } ]);
    function FileChecksum(file) {
      classCallCheck(this, FileChecksum);
      this.file = file;
      this.chunkSize = 2097152;
      this.chunkCount = Math.ceil(this.file.size / this.chunkSize);
      this.chunkIndex = 0;
    }
    createClass(FileChecksum, [ {
      key: "create",
      value: function create(callback) {
        var _this = this;
        this.callback = callback;
        this.md5Buffer = new sparkMd5.ArrayBuffer();
        this.fileReader = new FileReader();
        this.fileReader.addEventListener("load", function(event) {
          return _this.fileReaderDidLoad(event);
        });
        this.fileReader.addEventListener("error", function(event) {
          return _this.fileReaderDidError(event);
        });
        this.readNextChunk();
      }
    },

Conclusion

If there is anything I missed I do apologize in advance. I tried to be as thorough as possible.

So to Sum things up the following should suffice as an acceptable answer:

  • Proxy Upload Configuration: The ruby class Digest::MD5

  • Direct Upload Configuration: The JavaScript hash library SparkMD5.

It’s a base64-encoded MD5 digest of the blob’s data. I’m afraid Active Storage doesn’t support hexadecimal checksums like those emitted by md5(1). Sorry!

Spark.Bao

the source code is here: https://github.com/rails/rails/blob/master/activestorage/app/models/active_storage/blob.rb#L234

def compute_checksum_in_chunks(io)
  Digest::MD5.new.tap do |checksum|
    while chunk = io.read(5.megabytes)
      checksum << chunk
    end

    io.rewind
  end.base64digest
end

in my project, I need to use this checksum value to judge whether the user uploads the duplicated file, I use the following code to get the same value with above method:

md5 = Digest::MD5.file(params[:file].tempfile.path).base64digest
puts "========= md5: #{md5}"

the output:

========= md5: F/9Inmc4zdQqpeSS2ZZGug==

database data:

pry(main)> ActiveStorage::Blob.find_by(checksum: 'F/9Inmc4zdQqpeSS2ZZGug==')
  ActiveStorage::Blob Load (2.7ms)  SELECT  "active_storage_blobs".* FROM "active_storage_blobs" WHERE "active_storage_blobs"."checksum" = $1 LIMIT $2  [["checksum", "F/9Inmc4zdQqpeSS2ZZGug=="], ["LIMIT", 1]]
=> #<ActiveStorage::Blob:0x00007f9a16729a90
id: 1,
key: "gpN2NSgfimVP8VwzHwQXs1cB",
filename: "15 Celebrate.mp3",
content_type: "audio/mpeg",
metadata: {"identified"=>true, "analyzed"=>true},
byte_size: 9204528,
checksum: "F/9Inmc4zdQqpeSS2ZZGug==",
created_at: Thu, 29 Nov 2018 01:38:15 UTC +00:00>
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!