Faster Style Transfer – PyTorch & CuDNN

In our previous blog, we showed how to create our own mini Deep Learning pipeline to train some models using PyTorch. MNIST is pretty cool to rapidly prototype and test low level ideas in Deep Learning! It also uses very minimal compute resources to train and test models with MNIST. In this blog, let us move from MNIST dataset and try solving some other interesting challenges in Deep Learning, such as Style Transfer, and look at some stats on how we can achieve real-time inference on Desktop using only Python.

NOTE: If you would like to dive into code right way, the code for this project is available at – LINK.

For solving Artistic Style Transfer using PyTorch, let us use some data set with larger magnitude such as MS-COCO. Since there are many interesting blogs/articles online explaining Style Transfer, I would like to focus this blog more on certain tweaks to get efficient performance both in terms of Training and Inference. The rest of this blog is organized as follows – We will quickly go through the naive definition of Style Transfer, then we will use the code provided by the PyTorch examples and convert it into the pipeline we discussed in the ‘Intro To PyTorch’ blog, we will then quickly train the model with minimal hyper-parameter tuning and save the trained model. Finally we load this saved model in inference mode and use webcam feed to perform (may be) real-time style transfer.

What is Style Transfer?

Artistic Style Transfer

As the picture says it all, in the style transfer application we will train a network to convert the input (content) image into the desired style. If you are interested to learn more about this, please feel free to read the PyTorch tutorial on Neural Style.

Model Training

  • Clone the repo: StyleTransfer-PyTorch
  • Download the dataset from MS-COCO website and put it in data/ directory and use styles of your choice.
  • Your directory structure should look like this before training.
.
├── main.py (Entry point to the application)
├── net.py  (Net class for init/train/test DL models)
├── models/ (Directory containing different DL arch.)
|   ├── __init__.py
|   ├── transformer_net.py (Style Transfer Network)
|   ├── vgg.py 
|   └── ...
├── loaders/ (Directory containing data/model loaders)
|   ├── __init__.py
|   ├── data_loader.py (DataLoader class for loading data)
|   └── model_loader.py (ModelLoader class for loading models)
├── data/ (Directory containing data)
|   ├── coco/
|   |   └──train2014/ (Content images!)
|   └── styles/ (Directory for styles)
├── LICENSE  (License of your choice)
└── README.md(Proper documentation for Setup, Running & Results)
  • Create the Anaconda environment using the instructions given in – Intro To PyTorch.
  • Run – python main.py --phase train, to train the model.

Few things to consider for faster training –

  • Batch Size: You can play with the train_batch_size argument for faster training.
  • Workers for Data Loading: (NOTE: It doesn’t work in Windows). You can increase the num_workers argument to increase the number of concurrent workers for data pre-processing.
  • (Optional) Image Pre-processing backend: (NOTE: Might involve a hectic setup in Windows. Alternative is to use OpenCV instead of Pillow.) In PyTorch, we use the TorchVision module to ease the image pre-processing. By default, TorchVision uses Pillow backend. You can replace it with Pillow-SIMD to make your image pre-processing faster.
  • CuDNN backend: Make sure to set your backend to CuDNN if you are running your training on an Nvidia GPU. Also, setup CuDNN benchmark flag to True, for most optimal performance in both training and inference. The following code block will do the magic.
if torch.cuda.is_available():
    # Sanity check - Empty CUDA Cache
    torch.cuda.empty_cache()
    # Enforce CUDNN backend
    torch.backends.cudnn.benchmark = True
    torch.backends.cudnn.enabled = True

On an RTX 2060 GPU, for training 2 epochs with a batch-size of 6 it roughly takes around 40 mins.

Please note that the above metrics are for Windows 10 OS. In my case, the bottleneck part is the data loading. You can easily speed up the DataLoader by either adding num_workers > 1 or by storing mini-batches of pre-processed intermediate tensor representations of images in pickle or h5py format. Since the scope of the DataLoader topic is outside the contents covered in this blog, we can review it in the future.

Model Inference

Here comes my favorite part! Deep Learning models (real-time) inference is probably one of the least targeted areas in open-source blogs. Honestly, I have seen very less documentation around this topic online. Since, this area requires a little bit attention, considering the growing adoption of Deep Learning models in production, I thought it would be a good idea to discuss about this topic. In the following sections, we will start with a very simple inference pipeline and try to tweak every individual element of this inference pipeline to get the desired real-time experience. Since the topics that will be covered can be new to a lot of people, I will try to keep it concise and clear. Further, to keep things simple, I am implementing everything only using Python.

Rule #1: If you are using laptops instead of Desktop or Cloud VMs, please make sure that it is connected to a constant power source and running in High Performance mode. Running the laptop on battery might lead to shortage in power supply to the GPUs and this can lead to throttling of GPU.

Simple Inference Pipeline – Webcam

Let us write a simple inference pipeline using Webcam feed as input to the style transfer model. The inference pipeline for this scenario should look like follows:

Initialize Webcam
while True:
    FETCH frame from webcam
    PREPROCESS the frame
    INFERENCE by passing frame into DL Model
    RENDER results on screen

The final code for this section is available at – LINK.

1. Initial Inference Pipeline

# Load model in eval mode
model.eval()

# Setup content transform
content_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.mul(255))
])

# Initialize the camera
camera = cv2.VideoCapture(0)

with torch.no_grad():
    while True:
        # Fetch
        _, frame = camera.read()
        # Preprocess
        content_image = content_transform(frame)
        content_image = content_image.unsqueeze(0).cuda()
        # Predict
        output = model(content_image)
        # Postprocess
        output = output.cpu().detach()[0].clamp(0, 255).numpy().transpose(1,2,0).astype("uint8")
        
        # Render results
        cv2.imshow('Frame', output)
        k = cv2.waitKey(1)
        if k==27:
            break
    camera.release()

If you run the above block, you will achieve a performance of roughly 15 FPS (i.e., roughly 63.5 milliseconds per frame). There are few ways to optimize this, either go on optimizing the model architecture by trying several types of layer combinations, layer fusions etc., or first check how optimized is your end-to-end pipeline. In the blog, we will emphasis on optimizing the end-to-end pipeline. A detailed analysis of all the experiments are provided in the Jupyter Notebook – LINK.

2. Optimize Preprocessing

In preprocessing, we convert the webcam frame from UInt8 HWC format to tensor representation (Float32 NHWC format, where N is the number of examples, H = Height of the image, W = Width of the image, C = Number of Channels in the image). In the original implementation, we use the TorchVision transforms with Pillow backend to achieve this. This implementation roughly takes 8.6 milliseconds per frame. We can speed up some of these Ops by replacing Pillow with OpenCV+Numpy Ops. Upon further investigation, we observed that converting UInt8 to Float32 is the most costly step in the preprocessing phase. But transferring Float32 to GPU is faster than transferring UInt8.

Rule #2: Minimize the Data Transfers between CPU and GPU. It is necessary to note that GPU clock cycles are slower when compared to the CPU’s. So, design your computations wisely!

Upon testing a few ideas, we concluded that transferring the UInt8 to GPU + converting it to Float32 on GPU is way faster when compared to converting UInt8 to Float32 on CPU and then transfering the Float32 to GPU. The following code block summarizes this idea.

# Preprocess
frame = frame.swapaxes(1, 2).swapaxes(0, 1) # HWC -> CHW
frame = frame[np.newaxis, :, :, :]          # CHW -> Numpy NCHW
content_image = torch.from_numpy(frame)     # Numpy -> Torch Tensor
content_image = content_image.cuda()        # CPU (UInt8) -> GPU (Byte)
content_image = content_image.type(torch.cuda.FloatTensor)

A detailed explanation of above ops with runtimes can be seen in the following notebook. In the next section, we will see how we optimized the post-processing phase of the pipeline.

3. Optimize Post-processing

Thanks to Python, our entire post-processing can be written in a single line of code output = output.cpu().detach()[0].clamp(0, 255).numpy().transpose(1,2,0).astype("uint8"). Here is what we are doing in this line of code:

  • cpu() – Copy Float32 Tensor from GPU to CPU
  • detach()[0] – Don’t track its gradients (refer this) and convert it from Tensor to CHW representation.
  • clamp(0, 255) – Clamp the values of following Array to get image pixel values.
  • numpy() – Convert it from Torch to Numpy Array.
  • transpose(1,2,0) – Just a variant of swapaxes which converts CHW to HWC.
  • astype("uint8") – Just a Numpy’s way to change data-type of the Array.

This initial implementation roughly takes around 4 milliseconds. Well, that might seem fast but if we can design the ops properly, we can actually get this entire post-processing run at 0.5 milliseconds. Here is how we can do this – clamp() is a per-element operation which can be trivially parallelized! Simply push this Op to GPU. From preprocessing optimization, we have seen that type-conversions are faster on GPU than CPU. So, simply convert the tensor from Float to Byte on GPU before transferring it to CPU. Finally, do the rest of the ops on CPU. Here is the summary of what we said in a single line of code – output = output.clamp(0, 255).type(torch.cuda.ByteTensor).cpu().detach()[0].numpy().transpose(1,2,0).

4. Async Webcam Frame Extraction

Know your hardware limits! A typical consumer webcam can provide frames only at 30 FPS. There is no way you can speed this up. But, by carefully designing this webcam frame extraction on a separate thread, you can potentially remove the overhead involved in fetching frames from webcam. That can save us up to 33.3 milliseconds which can be used for other costly operations such as model inference. You can see the VideoCaptureAsync for understanding the implementation. There are many other ways to implement this asynchronous webcam frame extraction but this article, LINK, does a really good job in explaining.

Rule #3: Try to keep data loading on a separate thread. You can use that time for heavy compute such as model inference!

5. Keeping it all together

Here is the final reference implementation combining all the things we discussed above. This implementation can speed up our inference from 15.7 FPS to 21.3 FPS, which might not like a big speed-up but we have optimized almost every phase of inference pipeline, except the model architecture optimizations. I consider model optimizations as a bit advanced topic and leaving it for future blogs.

# Load model in eval mode
model.eval()

# Setup content transform
content_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.mul(255))
])

# Initialize the camera - Async
camera = VideoCaptureAsync(0)
camera.start()

with torch.no_grad():
    while True:
        # Fetch
        _, frame = camera.read()
        # Preprocess (Optimized)
        frame = frame.swapaxes(1, 2).swapaxes(0, 1)
        frame = frame[np.newaxis, :, :, :]
        content_image = torch.from_numpy(frame)
        content_image = content_image.cuda()
        content_image = content_image.type(torch.cuda.FloatTensor)
        # Predict
        output = model(content_image)
        # Postprocess - Optimized
        output = output.clamp(0, 255).type(torch.cuda.ByteTensor).cpu().detach()[0].numpy().transpose(1,2,0)
        
        # Render results
        cv2.imshow('Frame', output)
        k = cv2.waitKey(1)
        if k==27:
            break
    camera.release()

You can further optimize this implementation by using either CUDA Streams, or asyncio, but I am leaving those topics for the user to explore. Also, if you can design your model architecture in a way that you get the most benefit out of CuDNN, you can easily achieve upto >47 FPS. I am leaving this as an exercise to the readers to experiment. The following video shows the style transfer application running real-time. Please use “HD” before playing, as wordpress’s default video encoding is deteriorating the quality.

Real-time Style Transfer (640×480 @ 47 FPS) [NOTE: Play in HD for better experience]

Thank you for reading this post, if you want to stay up-to-date with my future articles, you can subscribe by entering your email below.

Processing…
Success! You're on the list.

The source code for this blog is available at LINK.

Advertisements

Intro To PyTorch

2018 has been a revolutionary year to the field of Deep Learning. Especially with the release of various libraries and numerous features in existing libraries. Let us quickly go through some current Deep Learning Libraries: TensorFlow, PyTorch, Apache MXNet, Chainer, TuriCreate, and CNTK. There are also some wrappers written around these libraries to simplify the use and creation of deep learning architectures. Some of those wrappers include Keras and fast.ai.

My favorite part is the comfort these libraries provided to make the process of deep learning architecture design, multi-GPU/distributed training, easing the creation of custom layers and custom loss functions, model-zoo with a pool of pre-trained models, and support for converting the trained models from library specific to architecture specific {iOS, Android, and Raspberry Pi} models.

It looks like we have a lot to cover to keep ourselves on toes and meet the pace of current market trends in Deep Learning and I, after a long gap, would like to start this with an introduction to PyTorch. The contents of this blog will be as follows:

  • Environment Setup
  • Introductory (modular) code as a skeleton for DL applications.
  • Data Loading
  • Model Training and Testing
  • Visualization
  • Model Saving and Loading

For simplicity, we are covering only a subset of things in this blog to get your wheels rolling into the field. And we are aiming to get a high-level overview which can eventually help you to design your own Deep Learning ToolKit.

Environment Setup

There are many ways to set up your Deep Learning environment –

  • Cloud VMs – Microsoft Azure, Google Cloud or Amazon’s AWS provide specialized VMs with pre-configured tools and libraries that can help you with your DL/ML journey.
  • Custom Setup – You can either have a local machine with decent GPUs or a cloud VM (with OS of your choice), and setup your deep learning environment by yourself using either virtual-env or Anaconda.

My personal choice of environment setup is by using Anaconda (for both Cloud VM and Custom Setup). You can download and install Anaconda using the following link – https://www.anaconda.com/distribution/. After downloading and installing Anaconda, you can setup your working environment using the following commands –

$ conda create -n oml python=3.6
$ conda activate oml
(oml) $ conda install pytorch torchvision -c pytorch
(oml) $ conda install tensorflow-gpu
(oml) $ conda install tensorboardX

Skeleton Code for DL applications

Long gone are the days where we create a single python script to create our Deep Learning models. To efficiently train large architectures on larger datasets, it will be less painful if we follow a modular code pattern. A lot of open-source contributors are already following a specific pattern which can look overwhelming to the newcomers. For simplicity, I would like to utilize this section on how I organize my projects related to Deep Learning. Please note that it is up to the individual’s choice to organize their projects and I am just providing the most optimal structure for my workflow. Here is a skeleton of my project structure –

.
├── main.py (Entry point to the application)
├── net.py  (Net class for init/train/test DL models)
├── models/ (Directory containing different DL arch.)
|   ├── __init__.py
|   ├── <model_name&gt;.py
|   └── ...
├── loaders/ (Directory containing data/model loaders)
|   ├── __init__.py
|   ├── data_loader.py (DataLoader class for loading data)
|   └── model_loader.py (ModelLoader class for loading model)
├── LICENSE  (License of your choice)
└── README.md(Proper documentation for Setup, Running &amp; Results)

The code for this repo can be seen at – LINK. The main.py contains the entry point for the entire pipeline with necessary arguments to train/infer the Deep Learning models. The net.py file has the Net class which uses one of the arguments to load the dataset and model structure for further computations.

Data Loading

Loading data efficiently for training and testing used to be a large hassle once upon a time. In PyTorch, loading and handling data has become easy by using the torch.utils.data.DataLoader. Torch also supports a lot of popular Datasets for easily loading through torchvision.datasets. A sample code for loading MNIST data can be written as follows –

def loadMNIST(self, args):
    self.train_loader = torch.utils.data.DataLoader(datasets.MNIST(args.data_dir, 
                                                    train=True, download=True,
                                                    transform=transforms.Compose([
                                                    transforms.ToTensor(),
                                                    transforms.Normalize((0.1307,), (0.3081,))
                                                    ])), batch_size=args.train_batch_size, shuffle=True, **self.kwargs)
    
    self.test_loader = torch.utils.data.DataLoader(datasets.MNIST(args.data_dir, 
                                                    train=False, 
                                                    transform=transforms.Compose([
                                                    transforms.ToTensor(),
                                                    transforms.Normalize((0.1307,), (0.3081,))
                                                    ])), batch_size=args.test_batch_size, shuffle=True, **self.kwargs)

Model Training and Testing

The net.py contains the helper code for training and testing the initialized model. This is the place where we initialize the required model, load the datasets required for training and testing, loads the necessary optimizer, and load/save models. One interesting thing to consider is the _build_model(self) method. For training, we can either load the pretrained model or start training from scratch. And while loading the model, we can either choose to run the training/inference of it from the GPU (or) CPU. If multiple GPUs are available, PyTorch’s nn.DataParallel can help with easy multi-GPU training.

def _build_model(self):
    # Load the model
    _model_loader = ModelLoader(self.args)
    self.model = _model_loader.model

    # If continue_train, load the pre-trained model
    if self.args.phase == 'train':
        if self.args.continue_train:
            self.load_model()

    # If multiple GPUs are available, automatically include DataParallel
    if self.args.multi_gpu and torch.cuda.device_count() > 1:
        self.model = nn.DataParallel(self.model)
    self.model.to(self.device)

Visualization

While training large deep neural networks, it will be helpful to visualize the loss, accuracy and other important metrics that can help us to debug our networks. TensorBoard can really come into handy for this. The interesting fact is we can integrate TensorBoard into our DL pipeline made using PyTorch with the help of TensorBoardX. And the code for integrating it is as easy as –

# Initialize summary writer
self.writer = SummaryWriter('runs/{}'.format(self.args.data_name))

# Add the values to Summary Writer
self.writer.add_scalar('train/loss', loss.item(), self.args.iter_count)

You can start the TensorBoard session and run the training by using the following command –

(oml) $ tensorboard --logdir=./runs/ --host 0.0.0.0 --port 6007 & python main.py --phase train --continue_train 0

# Go to http://localhost:6007 to see the results.

Save/Load Models

It is important to save your models periodically in the middle of your training. Longer training hours may lead to some unexpected OS errors and out-of-memory errors. Saving trained models periodically can save a lot of time and resources that are invested in your training phase. PyTorch model saving and loading is as easy as –

# Save the model
torch.save(self.model.state_dict(), model_filename)

# Load the (state_dict to) model
self.model.load_state_dict(torch.load(model_filename))

Source Code – LINK

The source code for this blog is made open-source so that other DL enthusiasts can use this as a primer for their DL related projects involving PyTorch.

Computer Vision in iOS – Object Detection

What is Object Detection?

In the past, I wrote a blog post on ‘Object Recognition’ and how to implement it in real-time on an iPhone (Computer Vision in iOS – Object Recognition). Now, what is this blog about? In this blog, I will be discussing about Object Detection.

In object recognition problem, the deep neural network is trained to recognise what objects are present in that image. In object detection, the deep neural network not only recognises what objects are present in the image, but also detects where they are located in the image. If object detection can be applied real-time, many problems in the autonomous driving can be solved very easily.

In this blog, let us take a sneak peek into how we can use Apple’s CoreML and implement Object Detection app on iPhone-7. We will start by becoming familiar with the coremltools and this might be a little confusing at first but follow through and you shall reap the reward.

Apple has done an amazing job in giving us the best tool to easily convert any model from the library of our choice to CoreML model. So, I would like thank the CoreML team before starting this blog for making whatever task that I once felt would take one year into one weekend project.

The theme for this blog is to use an object detection pipeline which runs real-time (on iPhone 7) and can be embedded into a driving application. So, when I say I want to detect objects that might appear in front of my car, it will be either car, pedestrian, trucks, etc. Among them I only want to detect cars for today. Thanks to Udacity’s Self Driving Car Nanodegree, because of which some students of their nanodegree program have open-sourced their amazing projects on github.

Object Detection has caught amazing attention in the deep-learning research and as a result there were some amazing papers published on this topic. While some papers solely focused on accuracy, some papers focused on real-time performance. If you want to explore more about this area, you can read the papers on: R-CNN, Fast R-CNN, Faster R-CNN, YOLO, YOLO-9000, SSD, MobileNet SSD. 🙂

What is the best network?

I have nearly mentioned 7 different types of networks above! Which network among those is the best one to implement? 🤔 Huh! it is a very difficult question, especially when you don’t want to waste your time on implementing every network by brute-force and check the results. So, let us do some analysis for selecting the best network to implement. What actually happens in an object detection pipeline?

  • Pre-processing: Fetch frame from the camera, and do some image processing (scale and resize) operations before sending it into the network.
  • Processing: This is the inference stage. We will pass the pre-processed image into the CoreML model and fetch the results generated by it.
  • Post-processing: The results generated by the CoreML model will be in MLMultiArray format and we need to do some processing on that array of doubles for getting the bounding box location and it’s class prediction+accuracy.

When I am targeting mobile phone, I should be concentrating on the real-time networks. So, I can stop considering all those networks that are not performing at decent FPS on a GPU equipped machine.

Rule 1:

If a network can run real-time on computer (GPU or CPU), then it is worth giving it a shot.

This rule strikes out R-CNN and Fast-RCNN from the above list. Though there are 5 networks now, they can be broadly classified into Faster R-CNN, YOLO and SSD. Both YOLO and SSD showed better performance when compared to Faster-RCNN (check their papers for run-time of those models). Now we are left out with two major options: YOLO and SSD. I started this object detection project when Apple’s coremltools v0.3.0 was in market. There hasn’t been extensive documentation and it used to support Keras 1.2.2 (with TensorFlow v1.0 & v1.1 as backend), and Caffe v1.0 only for neural networks. The CoreML team is constantly releasing new updates of coremltools and currently it is v0.4.0 with Keras 2.0.4 support. So, I should wisely choose a network with simple layers (only convolutions) and not fancy operations such as Deconvolutions, Dilated convolutions and Depth-wise convolutions. In this way, YOLO won over my personal favourite – SSD.

YOLO – You Only Look Once:

In this blog, I will be implementing Tiny YOLO v1 and I am keeping YOLO v2 implementation for some other time in the future. Let us familiarise with the network that we are going to use 😉 The Tiny YOLO v1 consists of 9 convolutional layers followed by 3 fully connected layers summing to ~45 million parameters.

mode_yolo_plot

It is quite big when compared to Tiny Yolo v2 which has only ~15 million parameters. The input to this network is a 448 x 448 x 3 RGB image and the output is a vector of length 1470. The vector is divided into three parts: probability, confidence and box coordinates. Each of these three parts is again divided into 49 small regions which correspond to predictions at each cell on the image.

net_output

Enough of theory, now let us make our hands dirty by doing some coding.

Pre-requisites: If you are reading this blog for the first time, please visit my previous blog – Computer Vision in iOS – CoreML+Keras+MNIST – for setting up working environment on your Mac machines. As training the YOLO network takes a lot of time and effort, we are going to use a pre-trained network weights for designing CoreML Tiny YOLO v1 model. You can download the weights from the following link.

After you downloaded the weights from above, create a master directory with the name of your choice, and move the weights-file that you downloaded into that directory.

  • Let us import some necessary libraries.
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import cv2

import keras
from keras.models import Sequential
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.layers.advanced_activations import LeakyReLU
from keras.layers.core import Flatten, Dense, Activation, Reshape
  • Define the Tiny YOLO v1 model.
def yolo(shape):
    model = Sequential()
    model.add(Conv2D(16,(3,3),strides=(1,1),input_shape=shape,padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Conv2D(32,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),padding='valid'))
    model.add(Conv2D(64,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),padding='valid'))
    model.add(Conv2D(128,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),padding='valid'))
    model.add(Conv2D(256,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),padding='valid'))
    model.add(Conv2D(512,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),padding='valid'))
    model.add(Conv2D(1024,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(Conv2D(1024,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(Conv2D(1024,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(Flatten())
    model.add(Dense(256))
    model.add(Dense(4096))
    model.add(LeakyReLU(alpha=0.1))
    model.add(Dense(1470))
    return model
  • Now let us write a helper function to load the weights from the ‘yolo-tiny.weights’ file into the model.
# Helper function to load weights from weights-file into YOLO model
def load_weights(model,yolo_weight_file):
    data = np.fromfile(yolo_weight_file,np.float32)
    data=data[4:]

    index = 0
    for layer in model.layers:
        shape = [w.shape for w in layer.get_weights()]
        if shape != []:
            kshape,bshape = shape
            bia = data[index:index+np.prod(bshape)].reshape(bshape)
            index += np.prod(bshape)
            ker = data[index:index+np.prod(kshape)].reshape(kshape)
            index += np.prod(kshape)
            layer.set_weights([ker,bia])
  • The model to which we are using pre-trained weights is having Theano as backend for image dimension ordering. So, we have to set the Theano as backend and then load the weights into the model.
# Load the initial model
keras.backend.set_image_dim_ordering('th')
shape = (3,448,448)
model = yolo(shape)
print "Theano mode summary: \n",model.summary()
load_weights(model,'./yolo-tiny.weights')

Layer dimensions:

I mentioned above that the model is following Theano’s Image dimension ordering. If you wonder what is this all about, then let me introduce you to 3D visualisations! A general 1D signal is represented using vectors or 1D arrays and its size is nothing but length of the array/vector. Images are nothing but 2D signals which are represented in 2D arrays or matrices. The size of the image is given by width x height of matrix. When we say convolutional layers, we are talking about 3D data-structures. Convolutional layers are nothing but a 2D matrices stacked behind one another. This new dimension is called the depth of the layer. For analogy, an RGB image is the combination of three 2D matrices placed behind one another (width x height x 3). In images, we refer them as channels and in convolutional layers this is mentioned as depth. Ok, this makes sense , but why are we discussing about image dimension ordering 😐 ? The two major libraries used for Deep Learning are Theano and TensorFlow. Keras is wrapper built over both of them and gives us flexibility to use any of those libraries. Apple’s coremltools support Keras with TensorFlow backend. The image dimension ordering of TensorFlow is width x height x depth, while that of Theano is depth x width x height. So, if we want to convert our model from Keras with Theano backend to CoreML model, we need to first convert it to Keras with TensorFlow backend. The complexity in understanding about the dimensions of weight matrix is higher when compared to image dimensions but transforming those weights from one model to another model is taken care inside Keras 2.0.4.

The real challenge is converting this model from theano backend to tensor flow backend is not straight forward! Fortunately, I found a helper code that can help in transferring the weights from a theano layer to tensorflow layer. But if we closely observe the model, it is a combination of convolutional layers and fully connected layers. So, before moving into fully connected layers we are flattening the output from the convolutional layer. This can become tricky and cannot be automated unless our brain can visualise the 3D and 4D dimensions very easily. For simplicity and for easy debugging, let us break this one single model into 3 separate chunks.

  • Part 1 – Consists of all convolutional layers.
  • Part 2 – Consists the operations required for flattening the output of Part 1.
  • Part 3 – Consists of fully connected layers applied on the output of Part 2.

Let us first define the models for Part 1 and Part 3.

# YOLO Part 1
def yoloP1(shape):
    model = Sequential()
    model.add(Conv2D(16,(3,3),strides=(1,1),input_shape=shape,padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Conv2D(32,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),padding='valid'))
    model.add(Conv2D(64,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),padding='valid'))
    model.add(Conv2D(128,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),padding='valid'))
    model.add(Conv2D(256,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),padding='valid'))
    model.add(Conv2D(512,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),padding='valid'))
    model.add(Conv2D(1024,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(Conv2D(1024,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(Conv2D(1024,(3,3),padding='same'))
    model.add(LeakyReLU(alpha=0.1))
    return model

# YOLO Part 3
def yoloP3():
    model = Sequential()
    model.add(Dense(256,input_shape=(50176,)))
    model.add(Dense(4096))
    model.add(LeakyReLU(alpha=0.1))
    model.add(Dense(1470))
    return model
  • Let us initialise three networks in keras with TensorFlow as backend. They are: YOLO Part 1 (model_p1), YOLO Part 3 (model_p3), and YOLO Full (model_full). Also for testing whether these networks are working correctly are not, let us also initialise Part 1 and Part 3 models with Theano backend.
# Let us get Theano backend edition of Yolo Parts 1 and 3
model_p1_th = yoloP1(shape)
model_p3_th = yoloP3()
# Tensorflow backend edition
keras.backend.set_image_dim_ordering('tf')
shape = (448,448,3)
model_p1 = yoloP1(shape)
model_p3 = yoloP3()
model_full = yolo(shape)
  • In our earlier discussion I mentioned that the dimensions of convolutional layers differ in Theano and TensorFlow. So let us write a program to convert weights from Theano’s ‘model’ to TensorFlow’s ‘model_full’.
# Transfer weights from Theano model to TensorFlow model_full
for th_layer,tf_layer in zip(model.layers,model_full.layers):
    if th_layer.__class__.__name__ == 'Convolution2D':
        kernel, bias = th_layer.get_weights()
        kernel = np.transpose(kernel,(2,3,1,0))
        tf_layer.set_weights([kernel,bias])
    else:
        tf_layer.set_weights(th_layer.get_weights())
  • Before moving into next phase, let us do a simple test to find out whether the outputs of Theano’s model and TensorFlow’s model_full are matching or not. For doing this, let us read an image, pre-process it and pass it into the models to predict the output.
# Read an image and pre-process it
im = cv2.imread('test1.jpg')
plt.imshow(im[:,:,::-1])
plt.show()
im = cv2.resize(im,(448,448))
im = 2*im.astype(np.float32)/255. - 1
im = np.reshape(im,(1,448,448,3))
im_th = np.transpose(im,(0,3,1,2))

# Theano
output_th = model.predict(im_th)
# TensorFlow
output_tf = model_full.predict(im)

# Distance between two predictions
print 'Difference between two outputs:\nSum of Difference =', np.sum(output_th-output_tf),'\n2-norm of difference =',np.linalg.norm(output_th-output_tf)
  • By running the above code, I found out that for a given image, the outputs of theano and tensor flow are varying a lot. If the outputs match, the ‘Sum of Difference’ and ‘2-norm of difference’ should be equal to 0.
  • Since the direct conversion is not helping, let us move to parts based model designing. First, let us start with Tiny YOLO Part 1.
# Theano
model_p1_th.layers[0].set_weights(model.layers[0].get_weights())
model_p1_th.layers[3].set_weights(model.layers[3].get_weights())
model_p1_th.layers[6].set_weights(model.layers[6].get_weights())
model_p1_th.layers[9].set_weights(model.layers[9].get_weights())
model_p1_th.layers[12].set_weights(model.layers[12].get_weights())
model_p1_th.layers[15].set_weights(model.layers[15].get_weights())
model_p1_th.layers[18].set_weights(model.layers[18].get_weights())
model_p1_th.layers[20].set_weights(model.layers[20].get_weights())
model_p1_th.layers[22].set_weights(model.layers[22].get_weights())

# TensorFlow
model_p1.layers[0].set_weights(model_full.layers[0].get_weights())
model_p1.layers[3].set_weights(model_full.layers[3].get_weights())
model_p1.layers[6].set_weights(model_full.layers[6].get_weights())
model_p1.layers[9].set_weights(model_full.layers[9].get_weights())
model_p1.layers[12].set_weights(model_full.layers[12].get_weights())
model_p1.layers[15].set_weights(model_full.layers[15].get_weights())
model_p1.layers[18].set_weights(model_full.layers[18].get_weights())
model_p1.layers[20].set_weights(model_full.layers[20].get_weights())
model_p1.layers[22].set_weights(model_full.layers[22].get_weights())

# Theano
output_th = model_p1_th.predict(im_th)
# TensorFlow
output_tf = model_p1.predict(im)

# Dimensions of output_th and output_tf are different, so apply transpose on output_th
output_thT = np.transpose(output_th,(0,2,3,1))

# Distance between two predictions
print 'Difference between two outputs:\nSum of Difference =', np.sum(output_thT-output_tf),'\n2-norm of difference =',np.linalg.norm(output_thT-output_tf)
  • By running the above code, we can find that the outputs of both the models match exactly. So we successfully completed designing the part 1 of our model! Now let us move to Part 3 of the model. By carefully observing the model summaries of model_p3 and model_p3_th, it is quite obvious to find that both the models are similar. Hence, for a given input both the models should give us the same fixed output. But what is the input for these models? Ideally the input for these models should come from Yolo Part 2. But Yolo Part 2 is just a flatten() layer, which means given any multi-dimensional input, the output will be a serialised 1D vector. Assuming we have serialised the output from model_p1_th, both model_p3 and model_p3_th should give us similar results.
# Theano
model_p3_th.layers[0].set_weights(model.layers[25].get_weights())
model_p3_th.layers[1].set_weights(model.layers[26].get_weights())
model_p3_th.layers[3].set_weights(model.layers[28].get_weights())

# TensorFlow
model_p3.layers[0].set_weights(model_full.layers[25].get_weights())
model_p3.layers[1].set_weights(model_full.layers[26].get_weights())
model_p3.layers[3].set_weights(model_full.layers[28].get_weights())

# Design the input
input_p3 = np.reshape(np.ndarray.flatten(output_th),(1,50176))

# Theano
output_th = model_p3_th.predict(input_p3)
# TensorFlow
output_tf = model_p3.predict(input_p3)

# Distance between two predictions
print 'Difference between two outputs:\nSum of Difference =', np.sum(output_th-output_tf),'\n2-norm of difference =',np.linalg.norm(output_th-output_tf)
  • We can observe that we get exactly the same results for both model_p3 and model_p3_th by running the above code.
  • Where is YOLO Part 2? Be patient, we are going to design it now 😅. Before designing YOLO Part 2, let us discuss some more about dimensions. I have already mentioned above that YOLO Part 2 is nothing but a simple flatten layer. What makes this so hard? If you remember, the whole network was designed with Theano as backend and we are just using those weights for our model with TensorFlow backend. For understanding the operation of flatten layer, I am adding some code for you to play and understand. By running the code below you can find out why our model_full gives weird results when compared to model.
# Let us build a simple 3(width) x 3(height) x 3(depth) matrix and assume it as an output from Part 1
A = np.reshape(np.asarray([i for i in range(1,10)]),(3,3))
B = A + 10
C = A + 100

print 'A =\n',A,'\n\nB =\n',B,'\n\nC =\n',C

part1_output_tf = np.dstack((A,B,C))
print '\n\nTensorFlow\'s model_p1 output (assume) = \n',part1_output_tf

part1_output_th = np.transpose(part1_output_tf,(2,0,1))
print '\n\nTheano\'s model_p1_th output (assume) = \n',part1_output_th

print '\n\nDesired input for model_p3 =\n', part1_output_th.flatten()
print '\n\nActual input for model_p3 =\n', part1_output_tf.flatten()
  • Now we understood that applying flatten layer is not that much easy as expected. There are few ideas on how we can implement this flatten layer:
    • Idea 1 – Fetch the output from Part 1 as MLMultiArray and apply a custom flatten operation on CPU of iOS app. Too costly operation!
    • Idea 2 – Design a model with Permute layer + Flatten layer using Keras and convert it to CoreML model. Can be done and if succeeded can be designed  into one single model.
    • Idea 3 – See what coremltools’ Neural Network Builder has to offer and try to implement the flatten layer with them. Enough documentation to implement flatten layer but can’t combine three models into one single Pipeline with current documentation support. For each image frame, there will be three fetches of memory from GPU to CPU and three passes from CPU to GPU. Not an effective implementation.
  • One interesting thing that I observed with Apple’s CoreML is that the output dimensions of MLMultiArray, though CoreML supports only Keras with TensorFlow backend, looks similar to the image dimensions supported by Theano. That means, MLMultiArray dimensions of YOLO Part 1’s output will be 1024 x 7 x 7 instead of 7 x 7 x 1024. This observed can be used while designing the Permute Layer of Part 2.
# Keras equivalent of YOLO Part 2
def yoloP2():
    model = Sequential()
    model.add(Permute((2,3,1),input_shape=(7,7,1024)))
    model.add(Flatten())
    return model

model_p2 = yoloP2()
  • With this, we have all the three parts that can be combined to form one complete network. So, let us re-write the Tiny YOLO v1 network.
def yoloP1P2P3(shape):
    model = Sequential()
    model.add(Convolution2D(16, 3, 3,input_shape=shape,border_mode='same',subsample=(1,1)))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Convolution2D(32,3,3 ,border_mode='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),border_mode='valid'))
    model.add(Convolution2D(64,3,3 ,border_mode='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),border_mode='valid'))
    model.add(Convolution2D(128,3,3 ,border_mode='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),border_mode='valid'))
    model.add(Convolution2D(256,3,3 ,border_mode='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),border_mode='valid'))
    model.add(Convolution2D(512,3,3 ,border_mode='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(MaxPooling2D(pool_size=(2, 2),border_mode='valid'))
    model.add(Convolution2D(1024,3,3 ,border_mode='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(Convolution2D(1024,3,3 ,border_mode='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(Convolution2D(1024,3,3 ,border_mode='same'))
    model.add(LeakyReLU(alpha=0.1))
    model.add(Permute((2,3,1)))
    model.add(Flatten())
    model.add(Dense(256))
    model.add(Dense(4096))
    model.add(LeakyReLU(alpha=0.1))
    model.add(Dense(1470))
    return model

model_p1p2p3 = yoloP1P2P3(shape)

# TensorFlow
model_p1p2p3.layers[0].set_weights(model_full.layers[0].get_weights())
model_p1p2p3.layers[3].set_weights(model_full.layers[3].get_weights())
model_p1p2p3.layers[6].set_weights(model_full.layers[6].get_weights())
model_p1p2p3.layers[9].set_weights(model_full.layers[9].get_weights())
model_p1p2p3.layers[12].set_weights(model_full.layers[12].get_weights())
model_p1p2p3.layers[15].set_weights(model_full.layers[15].get_weights())
model_p1p2p3.layers[18].set_weights(model_full.layers[18].get_weights())
model_p1p2p3.layers[20].set_weights(model_full.layers[20].get_weights())
model_p1p2p3.layers[22].set_weights(model_full.layers[22].get_weights())
model_p1p2p3.layers[26].set_weights(model_full.layers[25].get_weights())
model_p1p2p3.layers[27].set_weights(model_full.layers[26].get_weights())
model_p1p2p3.layers[29].set_weights(model_full.layers[28].get_weights())
  • If we go back to our conversation on three tasks that are needed to be done (Pre-processing, processing, post-processing), the process of converting the model from Keras to CoreML is the processing part. How are we going to do pre-processing then? The pre-processing of the image consists of fetching the frame from the camera, resize the image, change the format of the image into CVPixelBuffer format, scale the intensity values of the image from 0-255 to -1 to 1 and pass it into the model. But the scaling of the intensity values can be done directly inside the CoreML model. So, let us include during our conversion.
scale = 2/255.
coreml_model_p1p2p3 = coremltools.converters.keras.convert(model_p1p2p3,
                                                       input_names = 'image',
                                                       output_names = 'output',
                                                       image_input_names = 'image',
                                                       image_scale = scale,
                                                       red_bias = -1.0,
                                                       green_bias = -1.0,
                                                       blue_bias = -1.0)

coreml_model_p1p2p3.author = 'Sri Raghu Malireddi'
coreml_model_p1p2p3.license = 'MIT'
coreml_model_p1p2p3.short_description = 'Yolo - Object Detection'
coreml_model_p1p2p3.input_description['image'] = 'Images from camera in CVPixelBuffer'
coreml_model_p1p2p3.output_description['output'] = 'Output to compute boxes during Post-processing'
coreml_model_p1p2p3.save('TinyYOLOv1.mlmodel')
  • With this step, our Tiny YOLO v1 model is ready. The general computation of this model runs at an average of 17.8 FPS on iPhone 7. And the output of this network is a vector of size 1470. I adopted some techniques stated in some references cited below and used the power of GCD and Session Queues inside iOS to make the post-processing real-time.

Source Code & Results:

The whole source code for this project can be found at the following Github Link. All the necessary files for converting the model, creating the environment, and a step-by-step tutorial are available. I also provided the iOS app in case you are interested in testing it on your iPhones. Here are some results:

This slideshow requires JavaScript.

This app wouldn’t have been completed without the following wonderful previous works:

  1. https://github.com/xslittlegrass/CarND-Vehicle-Detection
  2. https://github.com/cvjena/darknet
  3. https://pjreddie.com/darknet/yolo/
  4. http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf

Though the app is giving decent results at a reasonable speed, there is always a room for improvement in the app for improving the performance and if you have any suggestions related to it, please feel free to comment your thoughts. 🙂

Computer Vision in iOS – CoreML 2.0 + Keras + MNIST

NOTE: This blog has been updated to CoreML 2.0 and Vision API.

Hello world! It has been quite a while since my last blog on Object Recognition on iPhone – Computer Vision in iOS – Object Recognition. I have been experimenting a lot on YOLO implementation on iPhone 7 and got lost in time. I will be discussing about how to implement YOLO (Object Detection) in my next blog but this blog, though just number recognition, will help you to understand how to write your own custom network from scratch using Keras and convert it to CoreML model. Since you will be learning and experimenting a lot of new things, I felt it is better to stick with a simple network with predictable results than working with deep(errrr….) networks.

Problem Statement:

Given a 28×28 image of hand written digit, find the model that can predict the digit with high accuracy.

Pipeline Setup:

Before reading this blog further, you require a machine with MacOS 10.14, iOS 12 and Xcode 10.

We need to setup a working environment on our machines for training, testing and converting the custom deep learning models to CoreML models. If you read the documentation of coremltools – link – they suggest to use virtualenv. I personally recommend using Anaconda over virtualenv. If you prefer to use Anaconda, check this past blog of mine which will help you to go through a step-by-step process of setting up a conda environment for deep learning on Mac machines – TensorFlow Diaries- Intro and Setup. At present, Apple’s coremltools require Python 3.6 for environment setup. Open Terminal and type the following commands for setting up the environment.

$ conda create -n coreml python=3.6
$ source activate coreml
(coreml) $ conda install pandas matplotlib jupyter notebook scipy scikit-learn opencv
(coreml) $ pip install tensorflow==1.5.0
(coreml) $ pip install keras==2.1.6
(coreml) $ pip install h5py
(coreml) $ pip install coremltools

Designing & Training the network:

For this part of the code, you can either create a python file and follow along or check the jupyter notebook I wrote for code+documentation.

  • First let us import some necessary libraries, and make sure that keras backend in TensorFlow.
import numpy as np

import keras

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.utils import np_utils

# (Making sure) Set backend as tensorflow
from keras import backend as K
K.set_image_dim_ordering('tf')
  • Now let us prepare the dataset for training and testing.
# Define some variables
num_rows = 28
num_cols = 28
num_channels = 1
num_classes = 10

# Import data
(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(X_train.shape[0], num_rows, num_cols, num_channels).astype(np.float32) / 255
X_test = X_test.reshape(X_test.shape[0], num_rows, num_cols, num_channels).astype(np.float32) / 255

y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
  • Design the model for training.
# Model
model = Sequential()

model.add(Conv2D(32, (5, 5), input_shape=(28, 28, 1), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(128, (1, 1), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  • Train the model.
# Training
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=200, verbose=2)
  • Prepare model for inference by removing dropout layers.
# Prepare model for inference
for k in model.layers:
    if type(k) is keras.layers.Dropout:
        model.layers.remove(k)
  • Finally save the model.
model.save('mnistCNN.h5')

Keras to CoreML:

To convert your model from Keras to CoreML, we need to do few more additional steps. Our deep learning model expects a 28×28 normalised grayscale image, and gives probabilities for the class predictions as output. Also, let us add little more information to our model such as license, author etc.

import coremltools

output_labels = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
scale = 1/255.
coreml_model = coremltools.converters.keras.convert('./mnistCNN.h5',
                                                   input_names='image',
                                                   image_input_names='image',
                                                   output_names='output',
                                                   class_labels=output_labels,
                                                   image_scale=scale)

coreml_model.author = 'Sri Raghu Malireddi'
coreml_model.license = 'MIT'
coreml_model.short_description = 'Model to classify hand written digit'

coreml_model.input_description['image'] = 'Grayscale image of hand written digit'
coreml_model.output_description['output'] = 'Predicted digit'

coreml_model.save('mnistCNN.mlmodel')
  • By executing the above code, you should observe a file named ‘mnistCNN.mlmodel’ in your current directory.

Congratulations! You have designed your first CoreML model. With this information, you can design any custom model using Keras and convert it into CoreML model.

iOS app:

Most of the contents from here are focused towards app development and I will be explaining only few important things. If you want to go through a step-by-step process of pipeline setup for using CoreML in an iOS app, then I would suggest you to visit my previous blog – Computer Vision in iOS – Object Recognition – before reading further. The whole code is available online at – github repoSimilar to Object Recognition app, I added a custom view named DrawView for writing digits through finger swipe (most of the code for this view has been taken with inspiration from Apple’s Metal example projects). I added two buttons named ‘Clear’ and ‘Detect’ whose names represent their functionality. As we discussed in our previous blog, CoreML requires image in CVPixelBuffer format, so I added the helper code that converts it into required format. If you use Vision API of Apple, it can take care of all this complex conversions among image formats automatically but that consumes an additional 20% CPU when compared to the method I propose. This 20% CPU usage will matter when you are designing a heavy ML oriented real-time application 😛 .Here are the results of the working prototype of my app-

This slideshow requires JavaScript.

Source Code:

If you like this blog and want to play with the app, the code for this app is available here – iOS-CoreML-MNIST.

 

Computer Vision in iOS – Object Recognition

NOTE: This blog has been updated to CoreML 2.0 and Vision API.

Problem Statement: Given an image, can a machine accurately predict what is there in that image?

Why is this so hard? If I show an image to a human and ask him/her what is there in that image, (s)he can predict exactly what objects are present in the image, where is that picture taken, what is the speciality of the image, (if people are present in the image) what is the action being done by them and what are they going to do etc. For a computer, a picture is nothing but a bunch of numbers. Hence, it can’t easily understand the semantics of it as a human does. Even after explaining this if the question – Why is it so hard? – is ringing in your head, then let me ask you to write an algorithm to detect (just) cat!

Having basic assumptions – every cat has two ears, an oval face with whiskers on it, a cylindrical body, four legs and a curvy tail! Perfect 🙂 We have our initial assumptions to start writing code! Assume we have written the code (per say, 50 lines of if-else statements) to find primitives in an image which when combined form a cat that looks nearly as shown in the figure below (PS: Don’t laugh 😛 )

Screen Shot 2017-06-08 at 8.00.50 PM

Ok let us test the performance on some real world images. Can our algorithm accurately predict the cat in this picture?

tabby-cat-names

If you think the answer is yes, I would suggest you to think again. If you carefully observe the cat image with primitive shapes, we have actually coded to find the cat that is turning towards its left. Ok! No worries! Write exact same if-else conditions for a cat turning towards its right 😎 . Just an extra 50 lines of conditions. Good! Now we have the cat detector! Can we detect the cat in this image? 😛

maxresdefault

Well, the answer is no 😦 . So, for tackling these type of problems we move from basic conditionals to Machine Learning/Deep Learning. Machine Learning is a field where machines learn how to do some specific tasks which only humans are capable of doing it before. Deep Learning is a subset of Machine Learning in which we train very deep neural network architectures. A lot of researchers have already solved this problem and there are some popular neural network architectures which do this specific task.

The real problem lies in importing this network into a mobile architecture and making it run real-time. This is not an easy task. First of all convolutions in a CNN is a costly step and the size of the neural network (forget about it 😛 ). The industries like Google, Apple etc and few research labs have put heavy focus on optimizing the size and performance of neural networks and at last we are having some decent results making neural networks work with decent speed on mobile phones. Still there is a lot of amazing research that needs to be done in this field. After Apple’s WWDC-’17 keynote, the whole app development for solving this particular problem has turned from a 1 year effort to single night effort. Enough of theory and facts, let us dive into the code!

For following this blog from here you need to have the following things ready:

  1. MacOS 10.14 (a.k.a MacOS Mojave)
  2. Xcode 10
  3. iOS 12 on your iPhone/iPad.
  4. Download pre-trained Inception-v3 model from Apple’s developer website – https://developer.apple.com/machine-learning/
  5. (Optional) Follow my previous blog to setup camera in your app – Computer Vision in iOS – Core Camera

Once you have satisfied all the above requirements, let us move to adding Machine Learning model into our app.

  • First of all, create a new Xcode ‘Single View App’ Project, select language as ‘Swift’ and set your project name and wait for Xcode to create project.
  • In this particular project, I am moving from my traditional CameraBuffer pipeline to a newer one to make the object recognition run constantly at 30 FPS asynchronously. We are using this approach to make sure that the user won’t feel any lag in the system (Hence, better user experience!). First add a new Swift file with name ‘PreviewView.swift’ and add the following code to it.
import UIKit
import AVFoundation

class PreviewView: UIView {
    var videoPreviewLayer: AVCaptureVideoPreviewLayer {
        return layer as! AVCaptureVideoPreviewLayer
    }

    var session: AVCaptureSession? {
        get {
            return videoPreviewLayer.session
        }
        set {
            videoPreviewLayer.session = newValue
        }
    }

    override class var layerClass: AnyClass {
        return AVCaptureVideoPreviewLayer.self
    }
}
  • Now let us add camera functionality to our app. If you followed my previous blog under optional pre-requisite. Most of the content here will look pretty obvious and easy. First go to Main.storyboard and add ‘View’ as a child object to existing View.

Screenshot 2018-07-08 at 11.46.23 AM

  • After dragging and dropping into the existing View, go to ‘Show the Identity Inspector’ in the right side inspector of Xcode and under ‘Custom Class’ change class  from UIView to ‘PreviewView’. If you recall, PreviewView is nothing but the new swift file we added in one of the previous steps in which we inherit few properties from UIView.

Screenshot 2018-07-08 at 11.44.55 AM

  • Make the View full screen with its content mode to ‘Aspect Fill’ and add two more labels for class and corresponding prediction confidence under it as a child to visualise the MLModel outputs. Add IBOutlets to both View and labels in ViewController.swift file.
  • Your current ViewController.swift file should look like this –
import UIKit

class ViewController: UIViewController {

    @IBOutlet weak var previewView: PreviewView!
    @IBOutlet weak var classLabel: UILabel!
@IBOutlet weak var confidenceLabel: UILabel!

    override func viewDidLoad() {
        super.viewDidLoad()
        // Do any additional setup after loading the view, typically from a nib.
    }

    override func didReceiveMemoryWarning() {
        super.didReceiveMemoryWarning()
        // Dispose of any resources that can be recreated.
    }

}
  • Let us initialise some parameters for session. The session should use frames from camera, it should start running when the view appears and stop running when the view disappears. Also we need to make sure that we have permissions to use camera and if permissions were not given, we should ask for permission before session starts. Hence, we should make the following changes to our code!
import UIKit
import AVFoundation

class ViewController: UIViewController, AVCaptureVideoDataOutputSampleBufferDelegate {

    @IBOutlet weak var previewView: PreviewView!
    @IBOutlet weak var predictLabel: UILabel!

    // Session - Initialization
    private let session = AVCaptureSession()
    private var isSessionRunning = false
    private let sessionQueue = DispatchQueue(label: "Camera Session Queue", attributes: [], target: nil)
    private var permissionGranted = false

    override func viewDidLoad() {
        super.viewDidLoad()
        // Do any additional setup after loading the view, typically from a nib.

        // Set some features for PreviewView
        self.previewView.videoPreviewLayer.videoGravity = AVLayerVideoGravityResizeAspectFill
        self.previewView.session = session

        // Check for permissions
        self.checkPermission()

        // Configure Session in session queue
        self.sessionQueue.async { [unowned self] in
            self.configureSession()
        }
    }

    // Check for camera permissions
    private func checkPermission() {
        switch AVCaptureDevice.authorizationStatus(forMediaType: AVMediaType.video) {
        case .authorized:
            self.permissionGranted = true
        case .notDetermined:
            self.requestPermission()
        default:
            self.permissionGranted = false
        }
    }

    // Request permission if not given
    private func requestPermission() {
        sessionQueue.suspend()
        AVCaptureDevice.requestAccess(forMediaType: AVMediaType.video) { [unowned self] granted in
            self.permissionGranted = granted
            self.sessionQueue.resume()
        }
    }

    // Start session
    override func viewWillAppear(_ animated: Bool) {
        super.viewWillAppear(animated)

        sessionQueue.async {
            self.session.startRunning()
            self.isSessionRunning = self.session.isRunning
        }
    }

    // Stop session
    override func viewWillDisappear(_ animated: Bool) {
        sessionQueue.async { [unowned self] in
            if self.permissionGranted {
                self.session.stopRunning()
                self.isSessionRunning = self.session.isRunning
            }
        }
        super.viewWillDisappear(animated)
    }

    // Configure session properties
    private func configureSession() {
        guard permissionGranted else { return }

        self.session.beginConfiguration()
        self.session.sessionPreset = .hd1280x720

        guard let captureDevice = AVCaptureDevice.default(.builtInWideAngleCamera, for: AVMediaType.video, position: .back) else { return }
        guard let captureDeviceInput = try? AVCaptureDeviceInput(device: captureDevice) else { return }
        guard self.session.canAddInput(captureDeviceInput) else { return }
        self.session.addInput(captureDeviceInput)

        let videoOutput = AVCaptureVideoDataOutput()

        videoOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "sample buffer"))
        videoOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as String : kCVPixelFormatType_32BGRA]
        videoOutput.alwaysDiscardsLateVideoFrames = true
        guard self.session.canAddOutput(videoOutput) else { return }
        self.session.addOutput(videoOutput)

        self.session.commitConfiguration()
        videoOutput.setSampleBufferDelegate(self, queue: sessionQueue)
    }

    override func didReceiveMemoryWarning() {
        super.didReceiveMemoryWarning()
        // Dispose of any resources that can be recreated.
    }

    // Do per-image frame executions here
    func captureOutput(_ output: AVCaptureOutput!, didOutput sampleBuffer: CMSampleBuffer!, from connection: AVCaptureConnection!) {
    // TODO: Do ML Here

    }

}

  • Don’t forget to add ‘Privacy-Camera Usage Description’ in Info.plist and run the app on your device. The app should show camera frames on screen with just 3% CPU usage 😉 Not bad! Now, let us add Inception v3 model to our app.
  • If you didn’t download the Inception v3 model yet, download it from the link provided above. By this step, you will be having a file named ‘Inceptionv3.mlmodel’.

Screen Shot 2017-06-12 at 11.01.47 AM

  • Drag and drop the ‘Inceptionv3.mlmodel’ file into your Xcode Project. After importing the model into your project, click on the model and this is how your ‘*.mlmodel’ file looks like in Xcode.

Screenshot 2018-07-08 at 11.38.56 AM

  • What information does ‘*.mlmodel’ file convey? At the starting of the file, you can observe some information about the file such as name of the file, size of it, author and license information, and description about the network. Then comes the  ‘Model Evaluation Parameters’ which explains us what should be the input of the model and how our output looks like. Now let us setup our ViewController.swift file to send images into the model for predictions.
  • Apple has made Machine Learning very easy through its CoreML Framework. All we have to do is ‘import CoreML’ and initialise model variable with ‘*.mlmodel’ file name.
import UIKit
import AVFoundation
import CoreML

class ViewController: UIViewController {

    @IBOutlet weak var previewView: PreviewView!
    @IBOutlet weak var predictLabel: UILabel!

    // Session - Initialization
    private let session = AVCaptureSession()
    private var isSessionRunning = false
    private let sessionQueue = DispatchQueue(label: "session queue", attributes: [], target: nil)
    private var permissionGranted = false

    // Model
    let model = try? VNCoreMLModel(for: Inceptionv3().model)<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;">&#65279;</span>

    override func viewDidLoad() { //...
  • The fun part begins now 🙂 . If we consider every Machine Learning/Deep Learning model as a black box (i.e., we don’t know what is happening inside), then all we should care about is given certain inputs to the black box, are we getting desired outputs? (PC: Wikipedia). But, we can’t send any type of input to the model and expect desired output. If the model is trained for a 1D signal, then input should be tweaked to 1D before sending into the model. If the model is trained for 2D (e.g.: CNNs), then input should be a 2D signal. The dimensions and size of the input should match with the model’s input parameters.

Blackbox3D-withGraphs

  • Luckily for models which take images as input Apple’s Vision API has made things very easy. For passing the image from camera stream into the model, we need to do the following edits to the captureOutput function at the end of ViewController.swift file.
    // Do per-image-frame executions here!!!
    func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        // TODO: Do ML Here
        guard let pixelBuffer : CVPixelBuffer = sampleBuffer.imageBuffer else { return }
        let request = VNCoreMLRequest(model: model!) {
            (finishedReq, err) in
            guard let results = finishedReq.results as? [VNClassificationObservation] else { return }
            guard let firstObservation = results.first else { return }

            DispatchQueue.main.async {
                self.classLabel.text = firstObservation.identifier
                self.confidenceLabel.text = NSString(format: "%.4f", firstObservation.confidence) as String
            }
        }
        try? VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: [:]).perform([request])

    }
  • What is actually happening in the above code block? Every frame captured by the camera is received by the above function in a CMSampleBuffer format. We first convert it to CVPixelBuffer and pass it to a Vision Request through a VNImageRequestHandler. The VNCoreMLRequest takes care of getting outputs and displaying it on the screen. The VNCoreMLRequest takes care all kinds of image transformations that are needed to be applied on the input image before passing it into the model (Such a life saver! when you don’t want to worry about juggling between image formats and image processing Ops when you specifically focus on model performance 😎 )
  • Here are some results of the app running on iPhone 7.

  • The results look convincing, but I should not judge the results as the network is not trained by me. What I care for is the performance of the app on the mobile phone! With the current implementation of the pipeline, while profiling the application, the CPU usage of the app is always <30%. Thanks to CoreML as the whole Deep Learning computations have been moved to GPU and the only task of CPU is to do some basic Image Processing and pass the image into the GPU, and fetch predictions from there. There is still a lot of scope to improve the coding style of the app, and I welcome any suggestions/advice from you. 🙂

    Source code:

    If you like this blog and want to play with the app, the code for this app is available here – iOS-CoreML-Inceptionv3

Wanna say thanks?

Like this blog? Found this blog useful and you feel that you learnt something at the end? Feel free to buy me a coffee 🙂 A lot of these blogs wouldn’t have been completed without the caffeine in my veins 😎

$5.00

Computer Vision in iOS – Swift+OpenCV

Hello all, I realised that it has been quite a while since I posted my last blog –  Computer Vision in iOS – Core Camera. In my last blog, I discussed about how we can setup Camera in our app without using OpenCV. Since the app has been designed in Swift 3, it is very easy for many budding iOS developers to understand what is going on in that code. I thought of going a step further and design some basic image processing algorithms from scratch. After designing few algorithms, I realised that it is quite hard for me to explain even simple RGB to grayscale conversion without scaring the readers. So, I thought of taking a few steps back and integrate OpenCV into the swift version of our Computer Vision app in hope that it can help the readers during their speed prototyping of proof-of-concepts. But many people have already discussed about how to integrate OpenCV into Swift based apps. The main purpose of this blog post is to introduce you to the data structure of the image and to explain why we are implementing certain things the way they are.

Before starting this blog, it is advised that you read this blog on setting up Core Camera using Swift.

  • Start by creating a new Xcode Project, select Single View Application. Name your project and organisation, set language as Swift.
  • For removing some constraints related to UI/UX and since most of the real-time performance apps in Vision either fix to Portrait or Landscape Left/Right orientation through out its usage, go to General -> Deployment Info and uncheck all unnecessary orientations of the app.

Screen Shot 2017-06-04 at 1.25.13 PM

  • Go to Main.storyboard and add the Image View to your app by drag-and-drop from the following menu to the storyboard.

Screen Shot 2017-06-04 at 1.29.39 PM

  • Go to “Show the Size Inspector” on the top-right corner and make the following changes.

Screen Shot 2017-06-04 at 1.35.40 PM

  • Now add some constraints to the Image View.

Screen Shot 2017-06-04 at 1.37.37 PM

  • After the above settings, you can observe that the Image View fills the whole screen on the app. Now go to ‘Show the attributes inspector’ on the top right corner and change ‘Content Mode’ from Scale To Fill to ‘Aspect Fill’.

Screen Shot 2017-06-04 at 1.40.18 PM

  • Now add an IBOutlet to the ImageView in ViewController.swift file. Also add the new swift file named ‘CameraBuffer.swift’ file and copy paste the code shown in the previous blog. Also change your ViewController.swift file as shown in previous blog. Now if you run your app, you can see a portrait mode camera app with ~30 FPS. (Note: Don’t forget to add permissions to use camera in Info.plist).

  • Let us dive into adding OpenCV into our app. First let us add the OpenCV Framework into our app. If you are following my blogs from starting, it should be easy for you.
  • Let us get into some theoretical discussion. (Disclaimer: It is totally fine to skip this bullet point if you only want the app working). What is an Image? From the signals and systems perspective, an Image is defined as a 2D discrete signal where each pixel signifies a value between 0-255 representing a specific gray level (0 represents black and 255 corresponds to white). To understand this better refer to the picture shown below (PC: Link). Now you might be wondering what is adding color to the image if each pixel is storing only the gray values. If you observe any documentation online you can see that the color image is actually referred as RGB image or RGBA image. The R,G, B in RGB image refers to the Red, Green and Blue Channels of the image and where each channel corresponds to the 2D grayscale signal with values between 0-255. The A channel in RGBA image represents the alpha channel or the opacity of that pixel. In OpenCV, the image is generally represented as a Matrix in BGR or BGRA format. In our code, we are getting access to the every single frame captured by camera in UIImage format. Hence, in order to do any image processing on these images we have to convert them from UIImage to cv::Mat and do all the processing that is required and send them back as UIImage to view it on the screen.

lincoln_pixel_values

1

  • Add a new file -> ‘Cocoa Touch Class’, name it ‘OpenCVWrapper’ and set language to Objective-C. Click Next and select Create. When it prompted to create bridging header click on the ‘Create Bridging Header’ button. Now you can observe that there are 3 files created with names: OpenCVWrapper.h, OpenCVWrapper.mm, and -Bridging-Header.h. Open ‘-Bridging-Header.h’ and add the following line: #import “OpenCVWrapper.h”
  • Go to ‘OpenCVWrapper.h’ file and add the following lines of code. In this tutorial, let us do the simple RGB to Grayscale conversion.
#import <Foundation/Foundation.h>
#import <UIKit/UIKit.h>

@interface OpenCVWrapper : NSObject

- (UIImage *) makeGray: (UIImage *) image;

@end

  • Rename OpenCVWrapper.m to “OpenCVWrapper.mm” for C++ support and add the following code.
#import "OpenCVWrapper.h"

// import necessary headers
#import <opencv2/core.hpp>
#import <opencv2/imgcodecs/ios.h>
#import <opencv2/imgproc/imgproc.hpp>

using namespace cv;

@implementation OpenCVWrapper

- (UIImage *) makeGray: (UIImage *) image {
    // Convert UIImage to cv::Mat
    Mat inputImage; UIImageToMat(image, inputImage);
    // If input image has only one channel, then return image.
    if (inputImage.channels() == 1) return image;
    // Convert the default OpenCV's BGR format to GrayScale.
    Mat gray; cvtColor(inputImage, gray, CV_BGR2GRAY);
    // Convert the GrayScale OpenCV Mat to UIImage and return it.
    return MatToUIImage(gray);
}

@end

  • Now make some final changes to the ViewController.mm to see the grey scale image on screen.
import UIKit

class ViewController: UIViewController, CameraBufferDelegate {

    var cameraBuffer: CameraBuffer!
    let opencvWrapper = OpenCVWrapper();
    @IBOutlet weak var imageView: UIImageView!

    override func viewDidLoad() {
        super.viewDidLoad()
        // Do any additional setup after loading the view, typically from a nib.
        cameraBuffer = CameraBuffer()
        cameraBuffer.delegate = self
    }

    func captured(image: UIImage) {
        imageView.image = opencvWrapper.makeGray(image)
    }

    override func didReceiveMemoryWarning() {
        super.didReceiveMemoryWarning()
        // Dispose of any resources that can be recreated.
    }

}
  • Here are the final screenshots of the working app. Hope you enjoyed this blog post. 🙂

Computer Vision in iOS – Core Camera

Computer Vision on mobile is fun! Here are the few reasons why I personally love computer vision on mobile when compared to traditional desktop based systems.

  1. You need not have to buy a web camera or high resolution camera which should be connected to computer through USB cable.
  2. You generally connect your webcam through a USB cable, so the application you are designing is restricted for testing only inside the circumference of the circle whose radius == length of the cable 😛 .
  3. If you want your system portable you might have to buy a Raspberry Pi or Arduino and connect your webcam to it for doing some processing on the frames it fetches. (My roommates & besties during my bachelors has done some extensive coding on microprocessors and micro controllers, and I clearly know how hard it is.)
  4. If I want to escape the above mentioned step and still want to make my system portable, I literally have to carry the CPU with me 😛

While discussing about the disadvantages of doing CV algorithms on traditional desktop systems you might be already inferring the advantages of mobile based pipelines. Mobiles are easily portable, it is fully equipped with CPU, GPU and various DSP modules which can be utilised based upon the application, and it has a high resolution camera 😉 The only disadvantage with the current mobile computer vision is that you can’t directly take the algorithm you designed that works almost real-time on a computer on to a mobile and expect the same results. Optimisation plays a key role in mobile computer vision. Mobile battery is limited, hence energy usage of your algorithm matters! If you are designing a heavy CV based system, you can’t schedule the whole operations on CPU. You might need to come up with some new strategies that can reduce the CPU usage!

By halting the discussion I started for no specific reason 🙄 , let us get into the topic that this blog is actually dedicated to 😀 .

In this blog, I will be designing an application using Swift and initialise the camera without using the OpenCV. The main idea has been taken inspiration from the following article by Boris Ohayon. In this blog I am developing over his idea and customising it for the applications that I will be designing in future. At any point of this blog, if you are clueless about the Camera pipeline, you can read the article (link provided above) and follow along this tutorial.

  • Without wasting any more time create a new ‘Single View Application’ with your desired product name and set the language as ‘Swift’.
  • Add an Image View in the Main.storyboard and reference it in ViewController.swift.
  • Create a new file named CameraBuffer.swift and add the following code
import UIKit
import AVFoundation

protocol CameraBufferDelegate: class {
    func captured(image: UIImage)
}

class CameraBuffer: NSObject, AVCaptureVideoDataOutputSampleBufferDelegate {
    // Initialise some variables
    private var permissionGranted = false
    private let sessionQueue = DispatchQueue(label: "session queue")

    private var position = AVCaptureDevicePosition.back
    private let quality = AVCaptureSessionPreset640x480
    private let captureSession = AVCaptureSession()
    private let context = CIContext()

    weak var delegate: CameraBufferDelegate?

    override init() {
        super.init()
        checkPermission()
        sessionQueue.async { [unowned self] in
            self.configureSession()
            self.captureSession.startRunning()
        }
    }

    private func checkPermission() {
        switch AVCaptureDevice.authorizationStatus(forMediaType: AVMediaTypeVideo) {
        case .authorized:
            permissionGranted = true
        case .notDetermined:
            requestPermission()
        default:
            permissionGranted = false
        }
    }

    private func requestPermission() {
        sessionQueue.suspend()
        AVCaptureDevice.requestAccess(forMediaType: AVMediaTypeVideo) { [unowned self] granted in
            self.permissionGranted = granted
            self.sessionQueue.resume()
        }
    }

    private func configureSession() {
        guard permissionGranted else { return }
        captureSession.sessionPreset = quality
        guard let captureDevice = selectCaptureDevice() else { return }
        guard let captureDeviceInput = try? AVCaptureDeviceInput(device: captureDevice) else { return }
        guard captureSession.canAddInput(captureDeviceInput) else { return }
        captureSession.addInput(captureDeviceInput)

        do {
            var finalFormat = AVCaptureDeviceFormat()
            var maxFps: Double = 0
            let maxFpsDesired: Double = 0 //Set it at own risk of CPU Usage
            for vFormat in captureDevice.formats {
                var ranges      = (vFormat as AnyObject).videoSupportedFrameRateRanges as!  [AVFrameRateRange]
                let frameRates  = ranges[0]
                
                if frameRates.maxFrameRate >= maxFps && frameRates.maxFrameRate <= maxFpsDesired {
                    maxFps = frameRates.maxFrameRate
                    finalFormat = vFormat as! AVCaptureDeviceFormat
                }
            }
            if maxFps != 0 {
                let timeValue = Int64(1200.0 / maxFps)
                let timeScale: Int32 = 1200
                try captureDevice.lockForConfiguration()
                captureDevice.activeFormat = finalFormat
                captureDevice.activeVideoMinFrameDuration = CMTimeMake(timeValue, timeScale)
                captureDevice.activeVideoMaxFrameDuration = CMTimeMake(timeValue, timeScale)
                captureDevice.focusMode = AVCaptureFocusMode.autoFocus
                captureDevice.unlockForConfiguration()
            }
            print(maxFps)
        }
        catch {
            print("Something was wrong")
        }
        
        let videoOutput = AVCaptureVideoDataOutput()
        videoOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "sample buffer"))
        guard captureSession.canAddOutput(videoOutput) else { return }
        captureSession.addOutput(videoOutput)
        guard let connection = videoOutput.connection(withMediaType: AVFoundation.AVMediaTypeVideo) else { return }
        guard connection.isVideoOrientationSupported else { return }
        guard connection.isVideoMirroringSupported else { return }
        connection.videoOrientation = .portrait
        connection.isVideoMirrored = position == .front
    }
    
    private func selectCaptureDevice() -> AVCaptureDevice? {
        return AVCaptureDevice.defaultDevice(withDeviceType: .builtInWideAngleCamera, mediaType: AVMediaTypeVideo, position: position)
    }
    
    private func imageFromSampleBuffer(sampleBuffer: CMSampleBuffer) -> UIImage? {
        guard let imageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return nil }
        let ciImage = CIImage(cvPixelBuffer: imageBuffer)
        guard let cgImage = context.createCGImage(ciImage, from: ciImage.extent) else { return nil }
        return UIImage(cgImage: cgImage)
    }
    
    func captureOutput(_ captureOutput: AVCaptureOutput!, didOutputSampleBuffer sampleBuffer: CMSampleBuffer!, from connection: AVCaptureConnection!) {
        guard let uiImage = imageFromSampleBuffer(sampleBuffer: sampleBuffer) else { return }
        DispatchQueue.main.async { [unowned self] in
            self.delegate?.captured(image: uiImage)
        }
    }
}
  • And the ViewController.swift file should like this:
import UIKit

class ViewController: UIViewController, CameraBufferDelegate {

    var cameraBuffer: CameraBuffer!

    @IBOutlet weak var imageView: UIImageView!

    override func viewDidLoad() {
        super.viewDidLoad()
        // Do any additional setup after loading the view, typically from a nib.
        cameraBuffer = CameraBuffer()
        cameraBuffer.delegate = self
    }

    func captured(image: UIImage) {
        imageView.image = image
    }

    override func didReceiveMemoryWarning() {
        super.didReceiveMemoryWarning()
        // Dispose of any resources that can be recreated.
    }

}
  • After this, you can build and run the app on your mobile and see that it works like a charm.
  • So what new thing(s) did I implement in the above code? I converted the code into Swift 3.0 supporting format, added a block through which you can set your FPS from 30 to as high as 240. And did rigorous tests to make sure that the camera pipeline will never go beyond the 10% CPU Usage on the iPhone for any realistic application.
  • If your application needs higher FPS, you can set it by changing the variable ‘maxFPSDesired. But change it only if you need FPS greater than 30. By default, the FPS will be between 24-30 (fluctuating) and if you want to force the FPS to a fixed number, it won’t be exactly equal to the number you fix and also the CPU usage increases drastically. But if the application you want to try doesn’t have any other costly computations, you can play with higher FPS.
  • How to count FPS of your app? You can go fancy and code the FPS counter to use in your app. I would suggest you to run the app in profiling mode and choose ‘Core Animation’ in Instruments to check the FPS of your app 😉