Quantization comes in many different forms. TensorFlow lite provides optimized kernels for 8-bit uint quantization. This specific form of evaluation is not directly supported in TensorFlow right now (though it can train such a model). We will be releasing training scripts that show how to setup such models for evaluation.
This should be possible, but we haven't tried it. We're likely going to add a simplified target that has minimal dependencies (like no Eigen) that allows building on simple platforms.
The quantization is done with a special training script that is quantization aware. We will be open sourcing a mobilenet quantized training script to show how to do this soon.
We developed TensorFlow lite to be small enough to target really small devices that lack MMU’s like the ARM Cortex M MCU series, but we haven’t done the actual work to target those devices. That being said, we are excited when the ecosystem and community around machine learning expands.
The main TensorFlow interpreter provides a lot of functionality for larger machines like servers (e.g. Desktop GPU support and distributed support). Of course, TensorFlow lite does run on standard PCs and servers, so using it on non-mobile/small devices is possible. If you wanted to create a very small microservice, TensorFlow lite would likely work, and we’d love to hear about your experiences, if you try this.
Thanks for the answer. Currently I’m using AWS Lambda to deploy my TensorFlow models. But it’s pretty hard and hacky. I need to remove a considerable portion of the code base that is not needed for inference only routines. I do that so the code loads faster and to fit the deployment package size limit.
If TensorFlow Lite is already a compressed code, then it may be much easier to deploy it to a serverless environment.
I’ll be trying it in my next deployments.
TensorFlow Lite is an interpreter in contrast with XLA which is a compiler. The advantage of TensorFlow lite is that a single interpreter can handle several models rather than needing specialized code for each model and each target platform. TensorFlow Lite’s core kernels have also been hand-optimized for common machine learning patterns. The advantage of compiler approaches is fusing many operations to reduce memory bandwidth (and thus speed). TensorFlow lite fuses many common patterns in the TensorFlow converter. We are of course excited about the possibility of using JIT techniques and using XLA technology within the TensorFlow Lite interpreter or as part of the TensorFlow Lite converter as a possible future direction.