# What is Object Detection?

In the past, I wrote a blog post on ‘Object Recognition’ and how to implement it in real-time on an iPhone (Computer Vision in iOS – Object Recognition). Now, what is this blog about? In this blog, I will be discussing about Object Detection.

In object recognition problem, the deep neural network is trained to recognise what objects are present in that image. In object detection, the deep neural network not only recognises what objects are present in the image, but also detects where they are located in the image. If object detection can be applied real-time, many problems in the autonomous driving can be solved very easily.

In this blog, let us take a sneak peek into how we can use Apple’s CoreML and implement Object Detection app on iPhone-7. We will start by becoming familiar with the coremltools and this might be a little confusing at first but follow through and you shall reap the reward.

Apple has done an amazing job in giving us the best tool to easily convert any model from the library of our choice to CoreML model. So, I would like thank the CoreML team before starting this blog for making whatever task that I once felt would take one year into one weekend project.

The theme for this blog is to use an object detection pipeline which runs real-time (on iPhone 7) and can be embedded into a driving application. So, when I say I want to detect objects that might appear in front of my car, it will be either car, pedestrian, trucks, etc. Among them I only want to detect cars for today. Thanks to Udacity’s Self Driving Car Nanodegree, because of which some students of their nanodegree program have open-sourced their amazing projects on github.

Object Detection has caught amazing attention in the deep-learning research and as a result there were some amazing papers published on this topic. While some papers solely focused on accuracy, some papers focused on real-time performance. If you want to explore more about this area, you can read the papers on: R-CNN, Fast R-CNN, Faster R-CNN, YOLO, YOLO-9000, SSD, MobileNet SSD. 🙂

# What is the best network?

I have nearly mentioned 7 different types of networks above! Which network among those is the best one to implement? 🤔 Huh! it is a very difficult question, especially when you don’t want to waste your time on implementing every network by brute-force and check the results. So, let us do some analysis for selecting the best network to implement. What actually happens in an object detection pipeline?

**Pre-processing:**Fetch frame from the camera, and do some image processing (scale and resize) operations before sending it into the network.**Processing:**This is the inference stage. We will pass the pre-processed image into the CoreML model and fetch the results generated by it.**Post-processing:**The results generated by the CoreML model will be in MLMultiArray format and we need to do some processing on that array of doubles for getting the bounding box location and it’s class prediction+accuracy.

When I am targeting mobile phone, I should be concentrating on the real-time networks. So, I can stop considering all those networks that are **not** performing at decent FPS on a GPU equipped machine.

### Rule 1:

If a network can run real-time on computer (GPU or CPU), then it is worth giving it a shot.

This rule strikes out R-CNN and Fast-RCNN from the above list. Though there are 5 networks now, they can be broadly classified into Faster R-CNN, YOLO and SSD. Both YOLO and SSD showed better performance when compared to Faster-RCNN (check their papers for run-time of those models). Now we are left out with two major options: YOLO and SSD. I started this object detection project when Apple’s coremltools v0.3.0 was in market. There hasn’t been extensive documentation and it used to support Keras 1.2.2 (with TensorFlow v1.0 & v1.1 as backend), and Caffe v1.0 only for neural networks. The CoreML team is constantly releasing new updates of coremltools and currently it is v0.4.0 with Keras 2.0.4 support. So, I should wisely choose a network with simple layers (only convolutions) and not fancy operations such as Deconvolutions, Dilated convolutions and Depth-wise convolutions. In this way, YOLO won over my personal favourite – SSD.

# YOLO – You Only Look Once:

In this blog, I will be implementing Tiny YOLO v1 and I am keeping YOLO v2 implementation for some other time in the future. Let us familiarise with the network that we are going to use 😉 The Tiny YOLO v1 consists of 9 convolutional layers followed by 3 fully connected layers summing to ~45 million parameters.

It is quite big when compared to Tiny Yolo v2 which has only ~15 million parameters. The input to this network is a 448 x 448 x 3 RGB image and the output is a vector of length 1470. The vector is divided into three parts: probability, confidence and box coordinates. Each of these three parts is again divided into 49 small regions which correspond to predictions at each cell on the image.

Enough of theory, now let us make our hands dirty by doing some coding.

**Pre-requisites: **If you are reading this blog for the first time, please visit my previous blog – Computer Vision in iOS – CoreML+Keras+MNIST – for setting up working environment on your Mac machines. As training the YOLO network takes a lot of time and effort, we are going to use a pre-trained network weights for designing CoreML Tiny YOLO v1 model. You can download the weights from the following **link**.

After you downloaded the weights from above, create a master directory with the name of your choice, and move the weights-file that you downloaded into that directory.

- Let us import some necessary libraries.

# Import necessary libraries import numpy as np import matplotlib.pyplot as plt %matplotlib inline import cv2 import keras from keras.models import Sequential from keras.layers.convolutional import Conv2D, MaxPooling2D from keras.layers.advanced_activations import LeakyReLU from keras.layers.core import Flatten, Dense, Activation, Reshape

- Define the Tiny YOLO v1 model.

def yolo(shape): model = Sequential() model.add(Conv2D(16,(3,3),strides=(1,1),input_shape=shape,padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Conv2D(32,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),padding='valid')) model.add(Conv2D(64,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),padding='valid')) model.add(Conv2D(128,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),padding='valid')) model.add(Conv2D(256,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),padding='valid')) model.add(Conv2D(512,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),padding='valid')) model.add(Conv2D(1024,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(Conv2D(1024,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(Conv2D(1024,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(Flatten()) model.add(Dense(256)) model.add(Dense(4096)) model.add(LeakyReLU(alpha=0.1)) model.add(Dense(1470)) return model

- Now let us write a helper function to load the weights from the ‘yolo-tiny.weights’ file into the model.

# Helper function to load weights from weights-file into YOLO model def load_weights(model,yolo_weight_file): data = np.fromfile(yolo_weight_file,np.float32) data=data[4:] index = 0 for layer in model.layers: shape = [w.shape for w in layer.get_weights()] if shape != []: kshape,bshape = shape bia = data[index:index+np.prod(bshape)].reshape(bshape) index += np.prod(bshape) ker = data[index:index+np.prod(kshape)].reshape(kshape) index += np.prod(kshape) layer.set_weights([ker,bia])

- The model to which we are using pre-trained weights is having Theano as backend for image dimension ordering. So, we have to set the Theano as backend and then load the weights into the model.

# Load the initial model keras.backend.set_image_dim_ordering('th') shape = (3,448,448) model = yolo(shape) print "Theano mode summary: \n",model.summary() load_weights(model,'./yolo-tiny.weights')

# Layer dimensions:

I mentioned above that the model is following Theano’s Image dimension ordering. If you wonder what is this all about, then let me introduce you to 3D visualisations! A general 1D signal is represented using vectors or 1D arrays and its size is nothing but length of the array/vector. Images are nothing but 2D signals which are represented in 2D arrays or matrices. The size of the image is given by width x height of matrix. When we say convolutional layers, we are talking about 3D data-structures. Convolutional layers are nothing but a 2D matrices stacked behind one another. This new dimension is called the depth of the layer. For analogy, an RGB image is the combination of three 2D matrices placed behind one another (width x height x 3). In images, we refer them as channels and in convolutional layers this is mentioned as depth. Ok, this makes sense , but why are we discussing about image dimension ordering 😐 ? The two major libraries used for Deep Learning are Theano and TensorFlow. Keras is wrapper built over both of them and gives us flexibility to use any of those libraries. Apple’s coremltools support Keras with TensorFlow backend. The image dimension ordering of TensorFlow is width x height x depth, while that of Theano is depth x width x height. So, if we want to convert our model from Keras with Theano backend to CoreML model, we need to first convert it to Keras with TensorFlow backend. The complexity in understanding about the dimensions of weight matrix is higher when compared to image dimensions but transforming those weights from one model to another model is taken care inside Keras 2.0.4.

The** real challenge **is converting this model from theano backend to tensor flow backend is not straight forward! Fortunately, I found a helper code that can help in transferring the weights from a theano layer to tensorflow layer. But if we closely observe the model, it is a combination of convolutional layers and fully connected layers. So, before moving into fully connected layers we are flattening the output from the convolutional layer. This can become tricky and cannot be automated unless our brain can visualise the 3D and 4D dimensions very easily. For simplicity and for easy debugging, let us break this one single model into 3 separate chunks.

**Part 1**– Consists of all convolutional layers.**Part 2**– Consists the operations required for flattening the output of Part 1.**Part 3**– Consists of fully connected layers applied on the output of Part 2.

Let us first define the models for Part 1 and Part 3.

# YOLO Part 1 def yoloP1(shape): model = Sequential() model.add(Conv2D(16,(3,3),strides=(1,1),input_shape=shape,padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Conv2D(32,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),padding='valid')) model.add(Conv2D(64,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),padding='valid')) model.add(Conv2D(128,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),padding='valid')) model.add(Conv2D(256,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),padding='valid')) model.add(Conv2D(512,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),padding='valid')) model.add(Conv2D(1024,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(Conv2D(1024,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) model.add(Conv2D(1024,(3,3),padding='same')) model.add(LeakyReLU(alpha=0.1)) return model # YOLO Part 3 def yoloP3(): model = Sequential() model.add(Dense(256,input_shape=(50176,))) model.add(Dense(4096)) model.add(LeakyReLU(alpha=0.1)) model.add(Dense(1470)) return model

- Let us initialise three networks in keras with TensorFlow as backend. They are: YOLO Part 1 (model_p1), YOLO Part 3 (model_p3), and YOLO Full (model_full). Also for testing whether these networks are working correctly are not, let us also initialise Part 1 and Part 3 models with Theano backend.

# Let us get Theano backend edition of Yolo Parts 1 and 3 model_p1_th = yoloP1(shape) model_p3_th = yoloP3() # Tensorflow backend edition keras.backend.set_image_dim_ordering('tf') shape = (448,448,3) model_p1 = yoloP1(shape) model_p3 = yoloP3() model_full = yolo(shape)

- In our earlier discussion I mentioned that the dimensions of convolutional layers differ in Theano and TensorFlow. So let us write a program to convert weights from Theano’s ‘model’ to TensorFlow’s ‘model_full’.

# Transfer weights from Theano model to TensorFlow model_full for th_layer,tf_layer in zip(model.layers,model_full.layers): if th_layer.__class__.__name__ == 'Convolution2D': kernel, bias = th_layer.get_weights() kernel = np.transpose(kernel,(2,3,1,0)) tf_layer.set_weights([kernel,bias]) else: tf_layer.set_weights(th_layer.get_weights())

- Before moving into next phase, let us do a simple test to find out whether the outputs of Theano’s model and TensorFlow’s model_full are matching or not. For doing this, let us read an image, pre-process it and pass it into the models to predict the output.

# Read an image and pre-process it im = cv2.imread('test1.jpg') plt.imshow(im[:,:,::-1]) plt.show() im = cv2.resize(im,(448,448)) im = 2*im.astype(np.float32)/255. - 1 im = np.reshape(im,(1,448,448,3)) im_th = np.transpose(im,(0,3,1,2)) # Theano output_th = model.predict(im_th) # TensorFlow output_tf = model_full.predict(im) # Distance between two predictions print 'Difference between two outputs:\nSum of Difference =', np.sum(output_th-output_tf),'\n2-norm of difference =',np.linalg.norm(output_th-output_tf)

- By running the above code, I found out that for a given image, the outputs of theano and tensor flow are varying a lot. If the outputs match, the ‘Sum of Difference’ and ‘2-norm of difference’ should be equal to 0.
- Since the direct conversion is not helping, let us move to parts based model designing. First, let us start with Tiny YOLO Part 1.

# Theano model_p1_th.layers[0].set_weights(model.layers[0].get_weights()) model_p1_th.layers[3].set_weights(model.layers[3].get_weights()) model_p1_th.layers[6].set_weights(model.layers[6].get_weights()) model_p1_th.layers[9].set_weights(model.layers[9].get_weights()) model_p1_th.layers[12].set_weights(model.layers[12].get_weights()) model_p1_th.layers[15].set_weights(model.layers[15].get_weights()) model_p1_th.layers[18].set_weights(model.layers[18].get_weights()) model_p1_th.layers[20].set_weights(model.layers[20].get_weights()) model_p1_th.layers[22].set_weights(model.layers[22].get_weights()) # TensorFlow model_p1.layers[0].set_weights(model_full.layers[0].get_weights()) model_p1.layers[3].set_weights(model_full.layers[3].get_weights()) model_p1.layers[6].set_weights(model_full.layers[6].get_weights()) model_p1.layers[9].set_weights(model_full.layers[9].get_weights()) model_p1.layers[12].set_weights(model_full.layers[12].get_weights()) model_p1.layers[15].set_weights(model_full.layers[15].get_weights()) model_p1.layers[18].set_weights(model_full.layers[18].get_weights()) model_p1.layers[20].set_weights(model_full.layers[20].get_weights()) model_p1.layers[22].set_weights(model_full.layers[22].get_weights()) # Theano output_th = model_p1_th.predict(im_th) # TensorFlow output_tf = model_p1.predict(im) # Dimensions of output_th and output_tf are different, so apply transpose on output_th output_thT = np.transpose(output_th,(0,2,3,1)) # Distance between two predictions print 'Difference between two outputs:\nSum of Difference =', np.sum(output_thT-output_tf),'\n2-norm of difference =',np.linalg.norm(output_thT-output_tf)

- By running the above code, we can find that the outputs of both the models match exactly. So we successfully completed designing the part 1 of our model! Now let us move to Part 3 of the model. By carefully observing the model summaries of model_p3 and model_p3_th, it is quite obvious to find that both the models are similar. Hence, for a given input both the models should give us the same fixed output. But what is the input for these models? Ideally the input for these models should come from Yolo Part 2. But Yolo Part 2 is just a flatten() layer, which means given any multi-dimensional input, the output will be a serialised 1D vector. Assuming we have serialised the output from model_p1_th, both model_p3 and model_p3_th should give us similar results.

# Theano model_p3_th.layers[0].set_weights(model.layers[25].get_weights()) model_p3_th.layers[1].set_weights(model.layers[26].get_weights()) model_p3_th.layers[3].set_weights(model.layers[28].get_weights()) # TensorFlow model_p3.layers[0].set_weights(model_full.layers[25].get_weights()) model_p3.layers[1].set_weights(model_full.layers[26].get_weights()) model_p3.layers[3].set_weights(model_full.layers[28].get_weights()) # Design the input input_p3 = np.reshape(np.ndarray.flatten(output_th),(1,50176)) # Theano output_th = model_p3_th.predict(input_p3) # TensorFlow output_tf = model_p3.predict(input_p3) # Distance between two predictions print 'Difference between two outputs:\nSum of Difference =', np.sum(output_th-output_tf),'\n2-norm of difference =',np.linalg.norm(output_th-output_tf)

- We can observe that we get exactly the same results for both model_p3 and model_p3_th by running the above code.
- Where is YOLO Part 2? Be patient, we are going to design it now 😅. Before designing YOLO Part 2, let us discuss some more about dimensions. I have already mentioned above that YOLO Part 2 is nothing but a simple flatten layer. What makes this so hard? If you remember, the whole network was designed with Theano as backend and we are just using those weights for our model with TensorFlow backend. For understanding the operation of flatten layer, I am adding some code for you to play and understand. By running the code below you can find out why our model_full gives weird results when compared to model.

# Let us build a simple 3(width) x 3(height) x 3(depth) matrix and assume it as an output from Part 1 A = np.reshape(np.asarray([i for i in range(1,10)]),(3,3)) B = A + 10 C = A + 100 print 'A =\n',A,'\n\nB =\n',B,'\n\nC =\n',C part1_output_tf = np.dstack((A,B,C)) print '\n\nTensorFlow\'s model_p1 output (assume) = \n',part1_output_tf part1_output_th = np.transpose(part1_output_tf,(2,0,1)) print '\n\nTheano\'s model_p1_th output (assume) = \n',part1_output_th print '\n\nDesired input for model_p3 =\n', part1_output_th.flatten() print '\n\nActual input for model_p3 =\n', part1_output_tf.flatten()

- Now we understood that applying flatten layer is not that much easy as expected. There are few ideas on how we can implement this flatten layer:
**Idea 1**– Fetch the output from Part 1 as MLMultiArray and apply a custom flatten operation on CPU of iOS app. Too costly operation!**Idea 2**– Design a model with Permute layer + Flatten layer using Keras and convert it to CoreML model. Can be done and if succeeded can be designed into one single model.**Idea 3**– See what coremltools’ Neural Network Builder has to offer and try to implement the flatten layer with them. Enough documentation to implement flatten layer but can’t combine three models into one single Pipeline with current documentation support. For each image frame, there will be three fetches of memory from GPU to CPU and three passes from CPU to GPU. Not an effective implementation.

- One interesting thing that I observed with Apple’s CoreML is that the output dimensions of MLMultiArray, though CoreML supports only Keras with TensorFlow backend, looks similar to the image dimensions supported by Theano. That means, MLMultiArray dimensions of YOLO Part 1’s output will be 1024 x 7 x 7 instead of 7 x 7 x 1024. This observed can be used while designing the Permute Layer of Part 2.

# Keras equivalent of YOLO Part 2 def yoloP2(): model = Sequential() model.add(Permute((2,3,1),input_shape=(7,7,1024))) model.add(Flatten()) return model model_p2 = yoloP2()

- With this, we have all the three parts that can be combined to form one complete network. So, let us re-write the Tiny YOLO v1 network.

def yoloP1P2P3(shape): model = Sequential() model.add(Convolution2D(16, 3, 3,input_shape=shape,border_mode='same',subsample=(1,1))) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Convolution2D(32,3,3 ,border_mode='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),border_mode='valid')) model.add(Convolution2D(64,3,3 ,border_mode='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),border_mode='valid')) model.add(Convolution2D(128,3,3 ,border_mode='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),border_mode='valid')) model.add(Convolution2D(256,3,3 ,border_mode='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),border_mode='valid')) model.add(Convolution2D(512,3,3 ,border_mode='same')) model.add(LeakyReLU(alpha=0.1)) model.add(MaxPooling2D(pool_size=(2, 2),border_mode='valid')) model.add(Convolution2D(1024,3,3 ,border_mode='same')) model.add(LeakyReLU(alpha=0.1)) model.add(Convolution2D(1024,3,3 ,border_mode='same')) model.add(LeakyReLU(alpha=0.1)) model.add(Convolution2D(1024,3,3 ,border_mode='same')) model.add(LeakyReLU(alpha=0.1)) model.add(Permute((2,3,1))) model.add(Flatten()) model.add(Dense(256)) model.add(Dense(4096)) model.add(LeakyReLU(alpha=0.1)) model.add(Dense(1470)) return model model_p1p2p3 = yoloP1P2P3(shape) # TensorFlow model_p1p2p3.layers[0].set_weights(model_full.layers[0].get_weights()) model_p1p2p3.layers[3].set_weights(model_full.layers[3].get_weights()) model_p1p2p3.layers[6].set_weights(model_full.layers[6].get_weights()) model_p1p2p3.layers[9].set_weights(model_full.layers[9].get_weights()) model_p1p2p3.layers[12].set_weights(model_full.layers[12].get_weights()) model_p1p2p3.layers[15].set_weights(model_full.layers[15].get_weights()) model_p1p2p3.layers[18].set_weights(model_full.layers[18].get_weights()) model_p1p2p3.layers[20].set_weights(model_full.layers[20].get_weights()) model_p1p2p3.layers[22].set_weights(model_full.layers[22].get_weights()) model_p1p2p3.layers[26].set_weights(model_full.layers[25].get_weights()) model_p1p2p3.layers[27].set_weights(model_full.layers[26].get_weights()) model_p1p2p3.layers[29].set_weights(model_full.layers[28].get_weights())

- If we go back to our conversation on three tasks that are needed to be done (
**Pre-processing, processing, post-processing**), the process of converting the model from Keras to CoreML is the**processing**part. How are we going to do pre-processing then? The pre-processing of the image consists of fetching the frame from the camera, resize the image, change the format of the image into CVPixelBuffer format, scale the intensity values of the image from 0-255 to -1 to 1 and pass it into the model. But the scaling of the intensity values can be done directly inside the CoreML model. So, let us include during our conversion.

scale = 2/255. coreml_model_p1p2p3 = coremltools.converters.keras.convert(model_p1p2p3, input_names = 'image', output_names = 'output', image_input_names = 'image', image_scale = scale, red_bias = -1.0, green_bias = -1.0, blue_bias = -1.0) coreml_model_p1p2p3.author = 'Sri Raghu Malireddi' coreml_model_p1p2p3.license = 'MIT' coreml_model_p1p2p3.short_description = 'Yolo - Object Detection' coreml_model_p1p2p3.input_description['image'] = 'Images from camera in CVPixelBuffer' coreml_model_p1p2p3.output_description['output'] = 'Output to compute boxes during Post-processing' coreml_model_p1p2p3.save('TinyYOLOv1.mlmodel')

- With this step, our Tiny YOLO v1 model is ready. The general computation of this model runs at an average of 17.8 FPS on iPhone 7. And the output of this network is a vector of size 1470. I adopted some techniques stated in some references cited below and used the power of GCD and Session Queues inside iOS to make the post-processing real-time.

# Source Code & Results:

The whole source code for this project can be found at the following **Github Link**. All the necessary files for converting the model, creating the environment, and a step-by-step tutorial are available. I also provided the iOS app in case you are interested in testing it on your iPhones. Here are some results:

This app wouldn’t have been completed without the following wonderful previous works:

- https://github.com/xslittlegrass/CarND-Vehicle-Detection
- https://github.com/cvjena/darknet
- https://pjreddie.com/darknet/yolo/
- http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf

Though the app is giving decent results at a reasonable speed, there is always a room for improvement in the app for improving the performance and if you have any suggestions related to it, please feel free to comment your thoughts. 🙂