TensorFlow Lite quantization fails to improve inference latency
TensorFlow website claims that Quantization provides up to 3x lower latency on mobile devices: https://www.tensorflow.org/lite/performance/post_training_quantization I tried to verify this claim, and found that Quantized models are 45%-75% SLOWER than Float models in spite of being almost 4 times smaller in size. Needless to say, this is very disappointing and conflicts with Google's claims. My test uses Google's official MnasNet model: https://storage.googleapis.com/mnasnet/checkpoints/mnasnet-a1.tar.gz Here is the average latency based on 100 inference operations on a freshly rebooted phone: