Photo by Caspar Camille Rubin / Unsplash

Tradeoffs of building on-device ML models

Posted on October 14th, 2019 by Furkan

As smartphones get faster, more machine learning workloads can be moved to user's devices instead of living in the cloud. This post explores tradeoffs of building such models and case study of a real-world application making use of it.

tl;dr: ML on the edge has advantages with respect to privacy, cost, speed, offline use and availability. There are also many challenges such as model size limits, debuggability and harder new model validation.



As we have seen from the many data leaks, securing user data is difficult. If the user's data never leaves their device, it's one less thing to worry about.


Running ML models in the cloud can end up costing quite a bit of money. If you want to have reasonable performance, you'll look into hardware acceleration such as GPUs. Currently, the cheapest GPU on AWS is a g4dn.xlarge which costs 0.526$/hour. If you multiply it by 24 hours/day * 30 days/month, you'll end up with a montly cost of 378.72$ (yearly cost of 4544.64$) and that is not accounting for bandwidth, maintenance, etc. Also, your cost will grow as your userbase grows. If more people use it, you'll need to get more servers.


Depending on your use case and model size, there can be a significant speed improvements by putting the computation on the user device. For example, if your model operates on images and lives in the cloud, the user will have to upload their image each and every time, they will get results based on their internet's speed at the time which can be highly variable.

Offline use and availability

It goes without saying, if you don't have internet, you can't use models living in the cloud. There are many scenarios where that could happen, for example, if you're in an airplane, in wilderness or you could live in a country where mobile internet is slow and you rely on wifi to download an application to use later on. Availability is another issue, imagine that a certain request crashes your servers, that would mean you're not able to process requests from any other user as well, whereas if a model fails on certain devices, only users on those devices would be afffected by it.


Model size limits

Many interesting model architectures can easily be GBs in size and require very powerful accelerators for inference. Obviously, these models won't scale down small enough to be shipped over the network and executed by a smartphone GPU. This can be a limiting factor in what can be achieved.

That said, there are many tricks that can be employed to get a reasonable model. The way I'd like to think about is, it needs to be "good enough", not perfect, we are not chasing the x% improvement over state of the art performance. Some of those tricks are quantization, distillation or building blocks like depthwise convolution.


It's difficult to debug a model if things go wrong with data that you don't have access to. It's even more difficult to know when things go wrong since you can't simply have someone manually look at it to label. You can develop proxy metrics to judge error rates, for example, if you're building a swipe based keyboard, you could count the number of times a user had to delete a word and manually type it out. You could explicitly ask a subset of your users if they are willing to share their data, especially if they are outliers with respect to your usual data distribution.

Harder new model validation

In line with the previous point, it's more difficult to iterate on new models without having real-world data to validate on. You can validate on your offline datasets with the understanding that it will miss real-world scenarios. Another possibility would be to shadow launch your new model and see how your result distribution shifts and if there are any user complaints.

Real-world showcase: FaceShape

What is FaceShape ?

It's a very simple app which tells users their face shape from a single front-facing picture. It runs directly within the browser.

Why is FaceShape useful ?

It's useful for determining ideal hairstyle, glasses and makeup for you. Your face shape has a big effect on what will or won't work for you. It's also difficult to objectively look at oneself to determine your face shape.

How does it work ?

FaceShape uses 6 different lightweight models together to give a final result. It uses the amazing face-api.js library by justadudewhohacks to:

  • get face bounding box using ssdMobilenetv1 model,
  • get facial landmarks using faceLandmark68Net inside the bounding box,
  • get face embeddings using faceRecognitionNet with weights coming from davisking's dlib.

3 lightweight classifiers are trained upon the facial landmarks and face embeddings with the goals of determining jaw type (angular or not), face length (long or not) and finally the actual shape. The classifiers are built with keras and served using tensorflow.js. The models achieve an accuracy of 70% for it's first choice and about 80-85% once we consider it's top 2 results. All in all, the models weight a total of about 9 megabytes which is roughly the size of a few high definition pictures.

In addition to on-device models, FaceShape is built as a jamstack architecture which makes maintenance very easy.

If there's a topic in this post you'd like to see expanded into details, please let me know either on Twitter or by email <> at gmail.

Thank you for reading this post, I hope you find it useful!