Computer Vision in iOS – Object Recognition

Problem Statement: Given an image, can a machine accurately predict what is there in that image?

Why is this so hard? If I show an image to a human and ask him/her what is there in that image, (s)he can predict exactly what objects are present in the image, where is that picture taken, what is the speciality of the image, (if people are present in the image) what is the action being done by them and what are they going to do etc. For a computer, a picture is nothing but a bunch of numbers. Hence, it can’t easily understand the semantics of it as a human does. Even after telling this if the question – Why is it so hard? – is ringing in your head, then let me ask you to write an algorithm to detect (just) cat!

Having basic assumptions – every cat has two ears, an oval face with whiskers on it, a cylindrical body, four legs and a curvy tail! Perfect 🙂 We have our initial assumptions to start writing code! Assume we have written the code (per say, 50 lines of if-else statements) to find primitives in an image which when combined form a cat that looks nearly as shown in the figure below (PS: Don’t laugh 😛 )

Screen Shot 2017-06-08 at 8.00.50 PM

Ok let us test the performance on some real world images. Can our algorithm accurately predict the cat in this picture?

tabby-cat-names

If you think the answer is yes, I would suggest you to think again. If you carefully observe the cat image with primitive shapes, we have actually coded to find the cat that is turning towards its left. Ok! No worries! Write exact same if-else conditions for a cat turning towards its right 😎 . Just an extra 50 lines of conditions. Good! Now we have the cat detector! Can we detect the cat in this image? 😛

maxresdefault

Well, the answer is no 😦 . So, for tackling these type of problems we move from basic conditionals to Machine Learning/Deep Learning. Machine Learning is a field where machines learn how to do some specific tasks which only humans are capable of doing it before. Deep Learning is a subset of Machine Learning in which we train very deep neural network architectures. A lot of researchers have already solved this problem and there are some popular neural network architectures which do this specific task.

The real problem lies in importing this network into a mobile architecture and making it run real-time. This is not an easy task. First of all convolutions in a CNN is a costly step and the size of the neural network (forget about it 😛 ). The industries like Google, Apple etc and few research labs have put heavy focus on optimizing the size and performance of neural networks and at last we are having some decent results making neural networks work with decent speed on mobile phones. Still there is a lot of amazing research that needs to be done in this field. After Apple’s WWDC-’17 keynote, the whole app development for solving this particular problem has turned from a 1 year effort to single night effort. Enough of theory and facts, let us dive into the code!

For following this blog from here you need to have the following things ready:

  1. MacOS 10.13 (a.k.a MacOS High Sierra)
  2. Xcode 9
  3. iOS 11 on your iPhone/iPad.
  4. Download pre-trained Inception-v3 model from Apple’s developer website – https://developer.apple.com/machine-learning/
  5. (Optional) Follow my previous blog to setup camera in your app – Computer Vision in iOS – Core Camera

Once you have satisfied all the above requirements, let us move to adding Machine Learning model into our app.

  • First of all, create a new Xcode ‘Single View App’ Project, select language as ‘Swift’ and set you project name and wait for Xcode to create project. Go to Build Settings of the app and change the Swift Compiler – Language – Swift Language version from Swift 4 to Swift 3.2.

Screen Shot 2017-06-12 at 10.54.28 AM

  • In this particular project, I am moving from my traditional CameraBuffer pipeline to a newer one to make the object detection run constantly at 30 FPS asynchronously. We are using this approach to make sure that the user won’t feel any lag in the system (Hence, better user experience!). First add a new Swift file name ‘PreviewView.swift’ and add the following code to it.
import UIKit
import AVFoundation

class PreviewView: UIView {
    var videoPreviewLayer: AVCaptureVideoPreviewLayer {
        return layer as! AVCaptureVideoPreviewLayer
    }

    var session: AVCaptureSession? {
        get {
            return videoPreviewLayer.session
        }
        set {
            videoPreviewLayer.session = newValue
        }
    }

    override class var layerClass: AnyClass {
        return AVCaptureVideoPreviewLayer.self
    }
}
  • Now let us add camera functionality to our app. If you followed my previous blog under optional pre-requisite. Most of the content here will look pretty obvious and easy. First go to Main.storyboard and add ‘View’ as a child object to existing View.

Screen Shot 2017-06-15 at 7.28.33 AM

  • After dragging and dropping into the existing View, go to ‘Show the Identity Inspector’ in the right side inspector of Xcode and under ‘Custom Class’ change class  from UIView to ‘PreviewView’. If you recall, PreviewView is nothing but the new swift file we added in one of the previous steps in which we inherit few properties from UIView.

Screen Shot 2017-06-15 at 7.44.09 AM

  • Make the View full screen with its content mode to ‘Aspect Fill’ and add a Label View under it as a child to see the prediction classes. Add IBOutlets to both View and LabelView in ViewController.swift file.
  • Your current ViewController.swift file should look like this –
import UIKit

class ViewController: UIViewController {

    @IBOutlet weak var previewView: PreviewView!
    @IBOutlet weak var predictLabel: UILabel!

    override func viewDidLoad() {
        super.viewDidLoad()
        // Do any additional setup after loading the view, typically from a nib.
    }

    override func didReceiveMemoryWarning() {
        super.didReceiveMemoryWarning()
        // Dispose of any resources that can be recreated.
    }

}
  • Le us initialise some parameters for session. The session should use frames from camera, it should start running when the view appears and stop running when the view disappears. Also we need to make sure that we have permissions to use camera and if permissions were not given, we should ask for permission before session starts. Hence, we should make the following changes to our code!
import UIKit
import AVFoundation

class ViewController: UIViewController {

    @IBOutlet weak var previewView: PreviewView!
    @IBOutlet weak var predictLabel: UILabel!

    // Session - Initialization
    private let session = AVCaptureSession()
    private var isSessionRunning = false
    private let sessionQueue = DispatchQueue(label: "session queue", attributes: [], target: nil)
    private var permissionGranted = false

    override func viewDidLoad() {
        super.viewDidLoad()
        // Do any additional setup after loading the view, typically from a nib.

        // Set some features for PreviewView
        previewView.videoPreviewLayer.videoGravity = AVLayerVideoGravityResizeAspectFill
        previewView.session = session

        // Check for permissions
        checkPermission()

        // Configure Session in session queue
        sessionQueue.async { [unowned self] in
            self.configureSession()
        }
    }

    // Check for camera permissions
    private func checkPermission() {
        switch AVCaptureDevice.authorizationStatus(forMediaType: AVMediaTypeVideo) {
        case .authorized:
            permissionGranted = true
        case .notDetermined:
            requestPermission()
        default:
            permissionGranted = false
        }
    }

    // Request permission if not given
    private func requestPermission() {
        sessionQueue.suspend()
        AVCaptureDevice.requestAccess(forMediaType: AVMediaTypeVideo) { [unowned self] granted in
            self.permissionGranted = granted
            self.sessionQueue.resume()
        }
    }

    // Start session
    override func viewWillAppear(_ animated: Bool) {
        super.viewWillAppear(animated)

        sessionQueue.async {
            self.session.startRunning()
            self.isSessionRunning = self.session.isRunning
        }
    }

    // Stop session
    override func viewWillDisappear(_ animated: Bool) {
        sessionQueue.async { [unowned self] in
            if self.permissionGranted {
                self.session.stopRunning()
                self.isSessionRunning = self.session.isRunning
            }
        }
        super.viewWillDisappear(animated)
    }

    // Configure session properties
    private func configureSession() {
        guard permissionGranted else { return }

        session.beginConfiguration()
        session.sessionPreset = AVCaptureSessionPreset1280x720

        guard let captureDevice = AVCaptureDevice.defaultDevice(withDeviceType: .builtInWideAngleCamera, mediaType: AVMediaTypeVideo, position: .back) else { return }
        guard let captureDeviceInput = try? AVCaptureDeviceInput(device: captureDevice) else { return }
        guard session.canAddInput(captureDeviceInput) else { return }
        session.addInput(captureDeviceInput)

        let videoOutput = AVCaptureVideoDataOutput()

        videoOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "sample buffer"))
        videoOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as String : kCVPixelFormatType_32BGRA]
        videoOutput.alwaysDiscardsLateVideoFrames = true
        guard session.canAddOutput(videoOutput) else { return }
        session.addOutput(videoOutput)

        session.commitConfiguration()
        videoOutput.setSampleBufferDelegate(self, queue: sessionQueue)
    }

    override func didReceiveMemoryWarning() {
        super.didReceiveMemoryWarning()
        // Dispose of any resources that can be recreated.
    }

}

extension ViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
    func captureOutput(_ output: AVCaptureOutput!, didOutputSampleBuffer sampleBuffer: CMSampleBuffer!, from connection: AVCaptureConnection!) {

    }
}
  • Don’t forget to add ‘Privacy-Camera Usage Description’ in Info.plist and run the app on your device. The app should show camera frames on screen with just 5% CPU usage 😉 Not bad! Now, let us add Inception v3 model to our app.
  • If you didn’t download the Inception v3 model yet, download it from the link provided above. By this step, you will be having a file named ‘Inceptionv3.mlmodel’.

Screen Shot 2017-06-12 at 11.01.47 AM

  • Drag and drop the ‘Inceptionv3.mlmodel’ file into your Xcode Project. After importing the model into your project, click on the model and this is how your ‘*.mlmodel’ file looks like in Xcode.

Screen Shot 2017-06-12 at 11.05.28 AM

  • What information does ‘*.mlmodel’ file convey? At the starting of the file, you can observe some information about the file such as name of the file, size of it, author and license information, and description about the network. Then comes the  ‘Model Evaluation Parameters’ which explains us what should be the input of the model and how our output looks like. Now let us setup our ViewController.swift file to send images into the model for predictions.
  • Apple has made Machine Learning very easy through its CoreML Framework. All we have to do is ‘import CoreML’ and initialise model variable with ‘*.mlmodel’ file name.
import UIKit
import AVFoundation
import CoreML

class ViewController: UIViewController {

    @IBOutlet weak var previewView: PreviewView!
    @IBOutlet weak var predictLabel: UILabel!

    // Session - Initialization
    private let session = AVCaptureSession()
    private var isSessionRunning = false
    private let sessionQueue = DispatchQueue(label: "session queue", attributes: [], target: nil)
    private var permissionGranted = false

    // Model
    let model = Inceptionv3()

    override func viewDidLoad() { //...
  • The fun part begins now 🙂 . If we consider every Machine Learning/Deep Learning model as a black box (i.e., we don’t know what is happening inside), then all we should care about is given certain inputs to the black box, are we getting desired outputs? (PC: Wikipedia). But, we can’t any type of input to the model and expect desired output. If the model is trained for a 1D signal, then input should be tweaked to 1D before sending into the model. If the model is trained for 2D (e.g.: CNNs), then input should be a 2D signal. The dimensions and size of the input should match with the model’s input parameters.

Blackbox3D-withGraphs

  • The Inception v3 model takes input a 3 channel RGB image of size 299x299x3. So, we should resize our image before passing it into the model. Add the following code at the end of the ViewController.swift file that will resize the image to our desired dimensions 😉 .
extension UIImage {
    func resize(_ size: CGSize)-> UIImage? {
        UIGraphicsBeginImageContext(size)
        draw(in: CGRect(x: 0, y: 0, width: size.width, height: size.height))
        let image = UIGraphicsGetImageFromCurrentImageContext()
        UIGraphicsEndImageContext()
        return image
    }
}
  • In order to pass the image into the CoreML model, we need to convert it from UIImage format to CVPixelBuffer. For doing the same, I am adding some Objective-C code and linking it to swift code using a Bridging Header. If you have no clue about the Bridging header and combing objective-C with Swift code, I would suggest to check out this blog – Computer Vision in iOS – Swift+OpenCV
  • ImageConverter.h
#import <Foundation/Foundation.h>
#import <AVFoundation/AVFoundation.h>

@interface ImageConverter : NSObject

+ (CVPixelBufferRef) pixelBufferFromImage: (CGImageRef) image;

@end
  • ImageConverter.m
#import "ImageConverter.h"

@implementation ImageConverter
+ (CVPixelBufferRef)pixelBufferFromImage:(CGImageRef)image {

    CGSize frameSize = CGSizeMake(CGImageGetWidth(image), CGImageGetHeight(image));
    CVPixelBufferRef pixelBuffer = NULL;
    CVReturn status = CVPixelBufferCreate(kCFAllocatorDefault, frameSize.width, frameSize.height, kCVPixelFormatType_32BGRA, nil, &pixelBuffer);
    if (status != kCVReturnSuccess) {
        return NULL;
    }

    CVPixelBufferLockBaseAddress(pixelBuffer, 0);
    void *data = CVPixelBufferGetBaseAddress(pixelBuffer);
    CGColorSpaceRef rgbColorSpace = CGColorSpaceCreateDeviceRGB();
    CGContextRef context = CGBitmapContextCreate(data, frameSize.width, frameSize.height, 8, CVPixelBufferGetBytesPerRow(pixelBuffer), rgbColorSpace, (CGBitmapInfo) kCGBitmapByteOrder32Little | kCGImageAlphaPremultipliedFirst);
    CGContextDrawImage(context, CGRectMake(0, 0, CGImageGetWidth(image), CGImageGetHeight(image)), image);

    CGColorSpaceRelease(rgbColorSpace);
    CGContextRelease(context);
    CVPixelBufferUnlockBaseAddress(pixelBuffer, 0);

    return pixelBuffer;
}
@end
  • iOS-CoreML-Inceptionv3-Bridging-Header.h
//
//  Use this file to import your target's public headers that you would like to expose to Swift.
//

#import "ImageConverter.h"
  • Now make some final changes to the file ViewController.swift. In this step, we resize the input image, convert it into CVPixelBuffer and pass it to CoreML model to predict the results.
extension ViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
    func captureOutput(_ output: AVCaptureOutput!, didOutputSampleBuffer sampleBuffer: CMSampleBuffer!, from connection: AVCaptureConnection!) {
        guard let imageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
        let ciImage = CIImage(cvImageBuffer: imageBuffer)
        guard let uiImage = UIImage(ciImage: ciImage).resize(modelInputSize),
            let cgImage = uiImage.cgImage,
            let pixelBuffer = ImageConverter.pixelBuffer(from: cgImage)?.takeRetainedValue(),
            let output = try? model.prediction(image: pixelBuffer) else {
                return
        }
        DispatchQueue.main.async {
            self.predictLabel.text = output.classLabel
        }
    }
}
  • Here are some results of the app running on iPhone 7.
  • The results look convincing, but I should not judge the results as the network is not trained by me. What I care for is the performance of the app on the mobile phone! With the current implementation of the pipeline, while profiling the application, the CPU usage of the app is always <30%. Thanks to CoreML as the whole Deep Learning computations have been moved to GPU and the only task of CPU is to do some basic Image Processing and pass the image into the GPU, and fetch predictions from there. There is still a lot of scope to improve the coding style of the app, and I welcome any suggestions/advice from you. 🙂 

    Source code:

    If you like this blog and want to play with the app, the code for this app is available here – iOS-CoreML-Inceptionv3

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s