Computer Vision in iOS – Object Recognition

NOTE: This blog has been updated to CoreML 2.0 and Vision API.

Problem Statement: Given an image, can a machine accurately predict what is there in that image?

Why is this so hard? If I show an image to a human and ask him/her what is there in that image, (s)he can predict exactly what objects are present in the image, where is that picture taken, what is the speciality of the image, (if people are present in the image) what is the action being done by them and what are they going to do etc. For a computer, a picture is nothing but a bunch of numbers. Hence, it can’t easily understand the semantics of it as a human does. Even after explaining this if the question – Why is it so hard? – is ringing in your head, then let me ask you to write an algorithm to detect (just) cat!

Having basic assumptions – every cat has two ears, an oval face with whiskers on it, a cylindrical body, four legs and a curvy tail! Perfect 🙂 We have our initial assumptions to start writing code! Assume we have written the code (per say, 50 lines of if-else statements) to find primitives in an image which when combined form a cat that looks nearly as shown in the figure below (PS: Don’t laugh 😛 )

Screen Shot 2017-06-08 at 8.00.50 PM

Ok let us test the performance on some real world images. Can our algorithm accurately predict the cat in this picture?

tabby-cat-names

If you think the answer is yes, I would suggest you to think again. If you carefully observe the cat image with primitive shapes, we have actually coded to find the cat that is turning towards its left. Ok! No worries! Write exact same if-else conditions for a cat turning towards its right 😎 . Just an extra 50 lines of conditions. Good! Now we have the cat detector! Can we detect the cat in this image? 😛

maxresdefault

Well, the answer is no 😦 . So, for tackling these type of problems we move from basic conditionals to Machine Learning/Deep Learning. Machine Learning is a field where machines learn how to do some specific tasks which only humans are capable of doing it before. Deep Learning is a subset of Machine Learning in which we train very deep neural network architectures. A lot of researchers have already solved this problem and there are some popular neural network architectures which do this specific task.

The real problem lies in importing this network into a mobile architecture and making it run real-time. This is not an easy task. First of all convolutions in a CNN is a costly step and the size of the neural network (forget about it 😛 ). The industries like Google, Apple etc and few research labs have put heavy focus on optimizing the size and performance of neural networks and at last we are having some decent results making neural networks work with decent speed on mobile phones. Still there is a lot of amazing research that needs to be done in this field. After Apple’s WWDC-’17 keynote, the whole app development for solving this particular problem has turned from a 1 year effort to single night effort. Enough of theory and facts, let us dive into the code!

For following this blog from here you need to have the following things ready:

  1. MacOS 10.14 (a.k.a MacOS Mojave)
  2. Xcode 10
  3. iOS 12 on your iPhone/iPad.
  4. Download pre-trained Inception-v3 model from Apple’s developer website – https://developer.apple.com/machine-learning/
  5. (Optional) Follow my previous blog to setup camera in your app – Computer Vision in iOS – Core Camera

Once you have satisfied all the above requirements, let us move to adding Machine Learning model into our app.

  • First of all, create a new Xcode ‘Single View App’ Project, select language as ‘Swift’ and set your project name and wait for Xcode to create project.
  • In this particular project, I am moving from my traditional CameraBuffer pipeline to a newer one to make the object recognition run constantly at 30 FPS asynchronously. We are using this approach to make sure that the user won’t feel any lag in the system (Hence, better user experience!). First add a new Swift file with name ‘PreviewView.swift’ and add the following code to it.
import UIKit
import AVFoundation

class PreviewView: UIView {
    var videoPreviewLayer: AVCaptureVideoPreviewLayer {
        return layer as! AVCaptureVideoPreviewLayer
    }

    var session: AVCaptureSession? {
        get {
            return videoPreviewLayer.session
        }
        set {
            videoPreviewLayer.session = newValue
        }
    }

    override class var layerClass: AnyClass {
        return AVCaptureVideoPreviewLayer.self
    }
}
  • Now let us add camera functionality to our app. If you followed my previous blog under optional pre-requisite. Most of the content here will look pretty obvious and easy. First go to Main.storyboard and add ‘View’ as a child object to existing View.

Screenshot 2018-07-08 at 11.46.23 AM

  • After dragging and dropping into the existing View, go to ‘Show the Identity Inspector’ in the right side inspector of Xcode and under ‘Custom Class’ change class  from UIView to ‘PreviewView’. If you recall, PreviewView is nothing but the new swift file we added in one of the previous steps in which we inherit few properties from UIView.

Screenshot 2018-07-08 at 11.44.55 AM

  • Make the View full screen with its content mode to ‘Aspect Fill’ and add two more labels for class and corresponding prediction confidence under it as a child to visualise the MLModel outputs. Add IBOutlets to both View and labels in ViewController.swift file.
  • Your current ViewController.swift file should look like this –
import UIKit

class ViewController: UIViewController {

    @IBOutlet weak var previewView: PreviewView!
    @IBOutlet weak var classLabel: UILabel!
@IBOutlet weak var confidenceLabel: UILabel!

    override func viewDidLoad() {
        super.viewDidLoad()
        // Do any additional setup after loading the view, typically from a nib.
    }

    override func didReceiveMemoryWarning() {
        super.didReceiveMemoryWarning()
        // Dispose of any resources that can be recreated.
    }

}
  • Let us initialise some parameters for session. The session should use frames from camera, it should start running when the view appears and stop running when the view disappears. Also we need to make sure that we have permissions to use camera and if permissions were not given, we should ask for permission before session starts. Hence, we should make the following changes to our code!
import UIKit
import AVFoundation

class ViewController: UIViewController, AVCaptureVideoDataOutputSampleBufferDelegate {

    @IBOutlet weak var previewView: PreviewView!
    @IBOutlet weak var predictLabel: UILabel!

    // Session - Initialization
    private let session = AVCaptureSession()
    private var isSessionRunning = false
    private let sessionQueue = DispatchQueue(label: "Camera Session Queue", attributes: [], target: nil)
    private var permissionGranted = false

    override func viewDidLoad() {
        super.viewDidLoad()
        // Do any additional setup after loading the view, typically from a nib.

        // Set some features for PreviewView
        self.previewView.videoPreviewLayer.videoGravity = AVLayerVideoGravityResizeAspectFill
        self.previewView.session = session

        // Check for permissions
        self.checkPermission()

        // Configure Session in session queue
        self.sessionQueue.async { [unowned self] in
            self.configureSession()
        }
    }

    // Check for camera permissions
    private func checkPermission() {
        switch AVCaptureDevice.authorizationStatus(forMediaType: AVMediaType.video) {
        case .authorized:
            self.permissionGranted = true
        case .notDetermined:
            self.requestPermission()
        default:
            self.permissionGranted = false
        }
    }

    // Request permission if not given
    private func requestPermission() {
        sessionQueue.suspend()
        AVCaptureDevice.requestAccess(forMediaType: AVMediaType.video) { [unowned self] granted in
            self.permissionGranted = granted
            self.sessionQueue.resume()
        }
    }

    // Start session
    override func viewWillAppear(_ animated: Bool) {
        super.viewWillAppear(animated)

        sessionQueue.async {
            self.session.startRunning()
            self.isSessionRunning = self.session.isRunning
        }
    }

    // Stop session
    override func viewWillDisappear(_ animated: Bool) {
        sessionQueue.async { [unowned self] in
            if self.permissionGranted {
                self.session.stopRunning()
                self.isSessionRunning = self.session.isRunning
            }
        }
        super.viewWillDisappear(animated)
    }

    // Configure session properties
    private func configureSession() {
        guard permissionGranted else { return }

        self.session.beginConfiguration()
        self.session.sessionPreset = .hd1280x720

        guard let captureDevice = AVCaptureDevice.default(.builtInWideAngleCamera, for: AVMediaType.video, position: .back) else { return }
        guard let captureDeviceInput = try? AVCaptureDeviceInput(device: captureDevice) else { return }
        guard self.session.canAddInput(captureDeviceInput) else { return }
        self.session.addInput(captureDeviceInput)

        let videoOutput = AVCaptureVideoDataOutput()

        videoOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "sample buffer"))
        videoOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as String : kCVPixelFormatType_32BGRA]
        videoOutput.alwaysDiscardsLateVideoFrames = true
        guard self.session.canAddOutput(videoOutput) else { return }
        self.session.addOutput(videoOutput)

        self.session.commitConfiguration()
        videoOutput.setSampleBufferDelegate(self, queue: sessionQueue)
    }

    override func didReceiveMemoryWarning() {
        super.didReceiveMemoryWarning()
        // Dispose of any resources that can be recreated.
    }

    // Do per-image frame executions here
    func captureOutput(_ output: AVCaptureOutput!, didOutput sampleBuffer: CMSampleBuffer!, from connection: AVCaptureConnection!) {
    // TODO: Do ML Here

    }

}

  • Don’t forget to add ‘Privacy-Camera Usage Description’ in Info.plist and run the app on your device. The app should show camera frames on screen with just 3% CPU usage 😉 Not bad! Now, let us add Inception v3 model to our app.
  • If you didn’t download the Inception v3 model yet, download it from the link provided above. By this step, you will be having a file named ‘Inceptionv3.mlmodel’.

Screen Shot 2017-06-12 at 11.01.47 AM

  • Drag and drop the ‘Inceptionv3.mlmodel’ file into your Xcode Project. After importing the model into your project, click on the model and this is how your ‘*.mlmodel’ file looks like in Xcode.

Screenshot 2018-07-08 at 11.38.56 AM

  • What information does ‘*.mlmodel’ file convey? At the starting of the file, you can observe some information about the file such as name of the file, size of it, author and license information, and description about the network. Then comes the  ‘Model Evaluation Parameters’ which explains us what should be the input of the model and how our output looks like. Now let us setup our ViewController.swift file to send images into the model for predictions.
  • Apple has made Machine Learning very easy through its CoreML Framework. All we have to do is ‘import CoreML’ and initialise model variable with ‘*.mlmodel’ file name.
import UIKit
import AVFoundation
import CoreML

class ViewController: UIViewController {

    @IBOutlet weak var previewView: PreviewView!
    @IBOutlet weak var predictLabel: UILabel!

    // Session - Initialization
    private let session = AVCaptureSession()
    private var isSessionRunning = false
    private let sessionQueue = DispatchQueue(label: "session queue", attributes: [], target: nil)
    private var permissionGranted = false

    // Model
    let model = try? VNCoreMLModel(for: Inceptionv3().model)<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;">&#65279;</span>

    override func viewDidLoad() { //...
  • The fun part begins now 🙂 . If we consider every Machine Learning/Deep Learning model as a black box (i.e., we don’t know what is happening inside), then all we should care about is given certain inputs to the black box, are we getting desired outputs? (PC: Wikipedia). But, we can’t send any type of input to the model and expect desired output. If the model is trained for a 1D signal, then input should be tweaked to 1D before sending into the model. If the model is trained for 2D (e.g.: CNNs), then input should be a 2D signal. The dimensions and size of the input should match with the model’s input parameters.

Blackbox3D-withGraphs

  • Luckily for models which take images as input Apple’s Vision API has made things very easy. For passing the image from camera stream into the model, we need to do the following edits to the captureOutput function at the end of ViewController.swift file.
    // Do per-image-frame executions here!!!
    func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        // TODO: Do ML Here
        guard let pixelBuffer : CVPixelBuffer = sampleBuffer.imageBuffer else { return }
        let request = VNCoreMLRequest(model: model!) {
            (finishedReq, err) in
            guard let results = finishedReq.results as? [VNClassificationObservation] else { return }
            guard let firstObservation = results.first else { return }

            DispatchQueue.main.async {
                self.classLabel.text = firstObservation.identifier
                self.confidenceLabel.text = NSString(format: "%.4f", firstObservation.confidence) as String
            }
        }
        try? VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: [:]).perform([request])

    }
  • What is actually happening in the above code block? Every frame captured by the camera is received by the above function in a CMSampleBuffer format. We first convert it to CVPixelBuffer and pass it to a Vision Request through a VNImageRequestHandler. The VNCoreMLRequest takes care of getting outputs and displaying it on the screen. The VNCoreMLRequest takes care all kinds of image transformations that are needed to be applied on the input image before passing it into the model (Such a life saver! when you don’t want to worry about juggling between image formats and image processing Ops when you specifically focus on model performance 😎 )
  • Here are some results of the app running on iPhone 7.
  • The results look convincing, but I should not judge the results as the network is not trained by me. What I care for is the performance of the app on the mobile phone! With the current implementation of the pipeline, while profiling the application, the CPU usage of the app is always <30%. Thanks to CoreML as the whole Deep Learning computations have been moved to GPU and the only task of CPU is to do some basic Image Processing and pass the image into the GPU, and fetch predictions from there. There is still a lot of scope to improve the coding style of the app, and I welcome any suggestions/advice from you. 🙂

    Source code:

    If you like this blog and want to play with the app, the code for this app is available here – iOS-CoreML-Inceptionv3

Wanna say thanks?

Like this blog? Found this blog useful and you feel that you learnt something at the end? Feel free to buy me a coffee 🙂 A lot of these blogs wouldn’t have been completed without the caffeine in my veins 😎

$5.00

Advertisements

4 thoughts on “Computer Vision in iOS – Object Recognition

    1. CoreML didn’t provide enough documentation for writing custom models itself and currently it is in beta stage and the team at Apple is actively releasing new updates. I would like to wait till they release a stable version (Fall-’17). Besides I am playing with integrating python into Unity for training/running custom ML models. I am interested in discussing more about this through mail than on blog.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s