Deep Learning on Mobile Devices

Written by Eric Han on March 17, 2018

Deep Learning on Mobile Devices

1. Abstract

Deep Learning is defined as a class of machine learning algorithms that attempts

to model high-level data abstractions in data. It is currently a hot machine-learning topic

due to its ability to function like the brain in many respects, especially with regards to the

processing of large amounts of multi-dimensional data. In mobile devices, deep learning

has been used for several features most often featuring image classification. Aside from

photo and face recognition, I believe that deep learning has huge potential to add

additional value to our mobile devices. In this report, I will discuss what exactly deep

learning is, do a comparison of deep learning to traditional algorithms, look at case

studies where deep learning would perform better than traditional algorithms and recent

improvements in deep convoluted neural networks, and examine at potential

opportunities for deep learning in mobile technologies.


2. Table of Contents

   1. Abstract

   2. Table of Contents

   3. Introduction

   4. Comparison to Traditional Algorithms

   5. Challenges

   6. Deep Learning on Mobile Platforms

   7. Discussion

   8. Conclusion

   9. Bibliography


3. Introduction

Deep Learning has been defined as a class of algorithms that use “a cascade of

many layers of nonlinear processing units for feature extraction, and transformation. Each

successive layer uses the output from the previous layer as input. The algorithms may be

supervised or unsupervised and applications include pattern analysis and classification.

They are based on the unsupervised learning of multiple layers of features or

representations of data. Higher-level features are derived from lower level features to

form a hierarchical representation. They are part of the broader machine-learning field of

learning representations from data. They learn multiple levels of representations that

correspond to different levels of abstraction; the levels form a hierarchy of concepts.”


Deep Learning has mainly been used, in the form of a deep convolutional neural

network, to implement image classification on mobile devices. In image classification,

the convoluted neural network behaves like a multi-layer perceptron model. A deep

convolutional neural network takes a raw image in as input and extracts features from

that image. If the input image was a human face for the task of facial recognition, the

model might recognize edges in the face in the first layer of the model. Then, in the later

layers, these edges might used to form the jawline, nose, lips, eyes, eyebrows, etc. This

step of using basic input features to extract higher-level features from the input image, or

to assign a label to input features, is the second step of deep convolutional neural

networks. A good feature vector representation is needed for good performance in this

aspect. In the final layers of the model, all these features are used to reconstruct the

human face. And if the pattern matches the face we are searching for, then it returns a

positive value for classification, and a negative value if it doesn’t match.

There are several ways to train a model in a deep convolutional network. For

image classification, stochastic gradient descent is often used. The update rule in SGD

includes weight decay, and for gradient direction the average of each batch at each

neuron was used. When validation errors stopped improving, the learning rate was

manually adjusted to smaller values. For the weight initialization, a zero-mean Gaussian

distribution was used. Depending on the layer, the bias can be set to either one or zero.

There are several known ways to improve the performance of a deep

convolutional neural network, which behaves in many ways like how a human baby when

he/she first learns to see. The convoluted neural network behaves like the human visual

cortex in how it receives input, processes it, and finally recognizes or identifies it as

something it has seen before. Pre-training the network can often yield better results, like

how humans learn to recognize objects when they experience visually more instances of

the same type of object. In addition, adding additional convolutional layers can also yield

better performance. We will discuss in later sections how researchers were able to

decrease the size and number of computations needed to run a deep convoluted neural

network, so that these neural networks could be ran on mobile platforms.


4. Comparison to Traditional Algorithms

Traditional algorithms refer to any type of algorithm that doesn’t use any type of

deep learning technique, or deep convolutional neural networks. In most traditional

algorithms, a vector representation can be used to train the classification of features in the

raw image. Fisher vectors are often used in this classification step. Then, by applying a

weight to each vector, like in a perceptron model, a weighted score is calculated to use in

classification of the image.

There are several systems of feature descriptors, including SIFT, CSFIT, LBP,

and GIST. These feature descriptors essentially capture local features independent of

rotation and scale. Extracted features are usually corners or edges. These local features

are then combined together to recognize larger objects. Humans often design the

algorithms that detect these local features that are later used in machine learning for

classification. This is perhaps the biggest difference between deep convolutional neural

networks and traditional algorithms. In deep convolutional networks, machine learning is

used to discover the algorithm that captures features, whereas in traditional algorithms

humans define the feature-capturing algorithm. As a result, the deep learning algorithm

can update its feature representation more effectively than a human can by utilizing an

increasing set of image data.

It is interesting to note that the reason deep convolutional networks perform better

than traditional algorithms, is due to the deep learning property of having improved

performance with additional training data used for the model. Traditional algorithms have

asymptotic performance, which is reached after a threshold amount of data is reached. In

addition, deep convolutional neural networks are not limited by feature descriptors, since

the feature descriptors defined by humans are often customized to recognize features in a

specific subset type of image, such as nature, buildings, humans etc. Deep convolutional

networks can always learn features regardless of image type and then obtain useful

information from input data that can then be used for feature classification. As a result,

deep convolutional networks would be expected to perform better in nearly all image

classification cases in comparison to conventional algorithms.


5. Deep Learning on Mobile Platforms

Two constraints that must be dealt with when we are dealing with deep learning

on mobile platforms are limited computational power and energy usage. In the last few

years, companies such as Qualcomm and NVIDIA have started designing mobile

hardware that have built-in deep neural network support. Qualcomm’s mobile

Snapdragon processor has a Digital Signal Processor (DSP) that enables low-power small

neural networks to run. Additionally, research into FPGA and RNN use for sped up

training time has been performed. If the rate of technology continues to increase as

rapidly as it has in recent decades, we can expect to have mobile devices in the future that

are capable and powerful enough to train deep neural networks that currently are only

able to run on multiple desktop GPUs.

Currently, on mobile platforms, initial neural network training is usually done

beforehand on dedicated hardware for pre-training, and then it is uploaded to the mobile

device for model improvement. An example of this could be fingerprint recognition. The

neural network is initially trained with large amounts of fingerprint data. Then, when it is

finally uploaded onto the mobile device, the neural network learns to recognize the

owner’s fingerprint. (Additional weight adjustments) According to the article “Deep

Learning on Mobile Platforms,” 4 improvements can be made to a deep neural network


1. Less important weights can be set to zero

2. Weights can be deduplicated

3. Lower precision arithmetic can be applied

4. Range-constrained topologies can be used

Two strategies of cloud and local computing can also be implemented. Some of the heavy

initial computation workload can be offloaded to the cloud, and sent back to the mobile

device when it is done computing. It is important to note that this is only possible when

an Internet connection is available. Otherwise, only local training on the mobile device

would be feasible. Because battery power is such a significant concern on mobile

platforms, this strategy of utilizing the cloud to offload and pre-train are essential if we

want to implement deep learning locally on our mobile devices.

Improved algorithms are another area of hope for increased adoption of deep

learning on mobile devices. An improved algorithm can reduce the space, time, and

computational requirements needed to make the necessary calculations in training

convolutional neural networks. In fact, convoluted neural networks have actually been

combined with classification trees recently. On an analytical level, this allows for a fewer

number of computations due to the face that this algorithm takes advantage of the tree

structure, where branching nodes can reduce the required computations needed to move

forward. In addition, only the children of a branch node would need to be calculated, and

not the whole layer itself, which lends itself very well to mobile platforms.

Also, improved weight initialization and stochastic gradient descent speed can

also lead to general performance improvement. Recently, researchers discovered a

method to reduce computational and memory requirements for deep convoluted neural

networks by utilizing a three-stage method of pruning, weight quantization, and Huffman

coding. This method ultimately reduced the storage requirements of the neural network

by up to almost 50 times. The pruning process involved pruning the network so that only

the truly important connections are learned. The weight quantization process enforced

weight sharing to reduce storage requirements. After the first two steps, the network is

retrained to fine tune the network and obtain “quantized centroids.” Lastly, Huffman

coding is applied. Pruning reduced space requirements by 9 to 13 times, while weight

quantization reduced the number of bits to represent each connection from 32 to 5 bits.

“On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x,

from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-

16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting

the model into on-chip SRAM cache rather than off-chip DRAM memory. Our

compression method also facilitates the use of complex neural networks in mobile

applications where application size and download bandwidth are constrained.

Benchmarked on CPU, GPU and mobile GPU, compressed network has 3x to 4x layer

wise speedup and 3x to 7x better energy efficiency.” (Mao, Dally) This is the type of

research that will accelerate adoption rates of deep learning on mobile platforms.

Another group of researchers took a structured pruning approach to deep

convolutional neural networks to tackle the issue of high computational complexity and

frequent memory access. Pruning is a good approach; however, it often results in

inconsistent network connections that not only require extra representation efforts but

also don’t perform well for parallel computation. The group introduced structured

sparsity at various scales: kernel-wise, channel-wise, and intra-kernel-wise. They found

that this type of structured sparsity was in fact very good for computational resource

requirements, meaning less was needed. The researchers decided to use a particle filtering

approach to gauge how important network connections and paths were. The weight of

each particle is determined by computing the misclassification rate with associated

connection. The pruned network is then re-trained to compensate for the losses due to

pruning. While implementing convolutions as matrix products, the researchers were able

to show that intra kernel strided sparsity with a simple constraint applied can reduce the

size of kernel and feature map matrices by a large amount. The pruned network is finally

fixed point optimized with reduced word length precision. (Anwar, Hwang, Sung) This

results in significant reduction in the total storage size providing benefits for on-chip

memory based and mobile platform implementations of deep neural networks.

As neural networks become more complicated and number of parameters

increase, it becomes increasingly necessary to limit neural network size. To improve

generalization performance and solve the problem of over fitting, a strategy called

“Optimal Brain Damage” is utilized, which selectively deletes weights based on their

saliency. A simple approach to saliency is to simply order them by magnitude. A more

advanced approach which is the one taken in the research study is to compute the

saliencies by taking the second derivative of the objective function with respect to

parameters. The objective function is approximated with a Taylor function and the

approximate saliencies are computed. The steps for “Optimal Brain Damage” are listed as


1. Choose reasonable network architecture.

2. Train the network until a reasonable solution is obtained. Type equation here.

3. Compute the second derivative for each parameter (hij)

4. Computer the saliencies for each parameter. (Sk= hkk(uk2/2))

5. Sort parameters by saliency and delete some low saliency parameters.

6. Iterate to step 2. (Le Cun, Denker, Solla)

After applying this technique to a neural network, the researchers were able to reduce the

number of parameters by a factor of four. Neural network speed increased significantly

and recognition accuracy improved as well. This approach reaffirms “Occam’s Razor”

principle, which states that the least complex explanation for data should be utilized

whenever possible. By reducing the number of parameters, and using a simpler network,

redundancies in data are removed and better generalization performance is obtained.


5. Challenges

Due to the nature of deep convolutional neural networks, large amounts of data

are required for optimal performance. Pre-training the deep convolutional networks

usually takes two steps. In the first step, a large amount of unlabeled data is used for

unsupervised learning. Then, to make final adjustments to hone the network to the

specific task, a smaller amount of labeled data is used for the second and final step of pre-

training. The large amounts of data required for deep learning is one challenge that it

currently faces. However, with the increasing amounts of available data continually

growing, this should not be the case for very much longer.

Another challenge for deep convolutional neural networks is the fact that complex

problems need larger networks to be able to perform optimally. The number of layers in

the neural network increases as well as the number of parameters for these complex

problems. The increase in number of computations needed increases power, hardware,

and time costs. This need in additional hardware results in many complex neural

networks to be currently run on distributed hardware. A different solution would be to

use improved hardware, which, with the arrival of recent high-end graphics cards, like the

NVIDIA Tegra K1, have enabled neural network training speeds to increase. In fact, Paul

Brasnett stated “Compared to mobile CPUs, PowerVR GPUs offer up to 3x higher

efficiency and up to 12x higher performance deployment for CNNs. Newer CNN

architectures with smaller fully connected layers help to make more efficient use of

compute resources.” (Brasnett) There are millions of PowerVR GPUs available on many

different SoCs (System on Chip) across different market segments. At the summit, a

demo was used with a Google Nexus Player with an Intel Atom quad-core SoC, equipped

with a PowerVR G6430. The demo used a camera to identify objects it was pointed at,

along with a confidence score to go along with how accurate its identification was. It is

important to note that this mobile GPU has several times the performance speed of a

traditional CPU running an identical network.

A last challenge for deep convolutional neural networks, and deep learning n

general, is the actual reason as to why it performs so well. Mathematically, it is not really

understood why it works so well. The impact of the number of layers on final neural

network performance is not even truly understood either. Finally, additional research is

needed to discover improved deep learning algorithms. If smaller models with less

computation and space requirements could be used, it would be a great way to enhance

mobile platform based deep learning, as mobile platforms are inherently not as powerful

as desktop devices, and are limited by their battery power much of the time.


6. Deep Learning Frameworks and Suitability for Mobile Platforms

Caffee is a framework coming out of Berkeley Vision and Learning Center, and

there exists support for many CPUs and GPUs. Recently, it added support for the

NVIDIA Tegra K1. Torch is another framework recently ported to support mobile

devices. DeepLearning4J is another recent framework that attempts to solve the problem

of multi-platform support by utilizing Java for its libraries, which allows it to run on any

platform that can run Java. TensorFlow is Google’s open-source framework for deep

learning. It allows large-scale machine learning neural networks to be implemented on

distributed systems, including mobile devices.


7. Discussion

As we have discussed, the deep learning technique is superior to traditional

machine learning in all regards of image classification. Now, the question is whether or

not deep learning can be effectively ported to mobile devices, given the high computing

power needed and large datasets required for deep learning. In recent years, mobile

device GPU and CPU producers have in fact started producing units such as the NVIDIA

Tegra K1, PowerVR, and newer Snapdragon processors, which can handle running deep

learning neural networks. Currently, it is still too much work for a mobile device to fully

pre-train and train a deep learning model because of excessive computation, power, and

time requirements. However, it was successfully shown that PowerVR chipset can handle

deep convolutional networks with ease, giving hope to the possibility of running deep

learning on mobile devices in the near future. I believe that in the coming decades, we

will see a boom in the number of devices that utilize deep learning or convoluted neural

networks, due to their inherently superior performance to traditional machine learning



8. Conclusion

With the large number of sensors that come on our mobile devices today, the

amount of data we are receiving, transmitting, and storing is continually increasing and

will likely increase even faster as we progress to the future. With camera, gyroscope,

thermometer, proximity, microphone, pressure, etc. sensors giving us all types of data

about our environment and ourselves, we have a lot of untapped opportunity to leverage

deep learning to our advantage. Embedded vision is expanding into a wide range of uses,

including computational photography, gaming, VR/AR, robotic, smart cars, and drones,

“Jeff Dean from the Google Brain team pointed out that with neural networks, results get

better with more data, bigger models, and more computation.” (Dean) In fact, the latest

GoogLeNet Inception neural network performed better than a human in image

recognition. Google’s AlphaGo, which defeated the champion world “Dan” at 5 games of

the Chinese game Go, is another example of where deep learning has excelled. I am

excited that deep learning has truly shown that machines can actually outperform humans

at nearly any task that requires a large amount of data, as long as the task can be learned.

I believe that the area of deep learning has just begun to blossom and show it true

potential. Perhaps in the future, there will be no more programmers, as machines can

program themselves to learn how to program; there would be no more architects, as

machines would learn how to design beautiful houses; there would be no more chefs, as

machines could learn to make new dishes that are delicious; there would be no more

teachers, as machines would know how to teach better than a professor; there would be

no real estate agents, as nearly all the paperwork and selling of the house could be

performed by a machine; there would be no more [occupations]...



1. “Deep Learning”, Wikipedia

2. “Deep Learning Neural Networks on Mobile Platforms”, Andreas Plienlinger

3. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding”, Song Han, Huizi Mao, William J. Dally

4. “Structured Pruning of Deep Convolutional Neural Networks”, Sajid Anwar, Kyuyeon Hwang, Wonyong Sung

5. “Deep Learning on Mobile Devices at the Embedded Vision Summit 2016”, Chris Longstaff

6. “Optimal Brain Damage”, Yann Le Cun, John S. Denker, Sara A. Solla, AT&T Bell Laboratories, Holmdel, NJ 07733


Webmaster: Farnam Adelkhani