Faster Style Transfer – PyTorch & CuDNN

In our previous blog, we showed how to create our own mini Deep Learning pipeline to train some models using PyTorch. MNIST is pretty cool to rapidly prototype and test low level ideas in Deep Learning! It also uses very minimal compute resources to train and test models with MNIST. In this blog, let us move from MNIST dataset and try solving some other interesting challenges in Deep Learning, such as Style Transfer, and look at some stats on how we can achieve real-time inference on Desktop using only Python.

NOTE: If you would like to dive into code right way, the code for this project is available at – LINK.

For solving Artistic Style Transfer using PyTorch, let us use some data set with larger magnitude such as MS-COCO. Since there are many interesting blogs/articles online explaining Style Transfer, I would like to focus this blog more on certain tweaks to get efficient performance both in terms of Training and Inference. The rest of this blog is organized as follows – We will quickly go through the naive definition of Style Transfer, then we will use the code provided by the PyTorch examples and convert it into the pipeline we discussed in the ‘Intro To PyTorch’ blog, we will then quickly train the model with minimal hyper-parameter tuning and save the trained model. Finally we load this saved model in inference mode and use webcam feed to perform (may be) real-time style transfer.

What is Style Transfer?

Artistic Style Transfer

As the picture says it all, in the style transfer application we will train a network to convert the input (content) image into the desired style. If you are interested to learn more about this, please feel free to read the PyTorch tutorial on Neural Style.

Model Training

  • Clone the repo: StyleTransfer-PyTorch
  • Download the dataset from MS-COCO website and put it in data/ directory and use styles of your choice.
  • Your directory structure should look like this before training.
.
├── main.py (Entry point to the application)
├── net.py  (Net class for init/train/test DL models)
├── models/ (Directory containing different DL arch.)
|   ├── __init__.py
|   ├── transformer_net.py (Style Transfer Network)
|   ├── vgg.py 
|   └── ...
├── loaders/ (Directory containing data/model loaders)
|   ├── __init__.py
|   ├── data_loader.py (DataLoader class for loading data)
|   └── model_loader.py (ModelLoader class for loading models)
├── data/ (Directory containing data)
|   ├── coco/
|   |   └──train2014/ (Content images!)
|   └── styles/ (Directory for styles)
├── LICENSE  (License of your choice)
└── README.md(Proper documentation for Setup, Running & Results)
  • Create the Anaconda environment using the instructions given in – Intro To PyTorch.
  • Run – python main.py --phase train, to train the model.

Few things to consider for faster training –

  • Batch Size: You can play with the train_batch_size argument for faster training.
  • Workers for Data Loading: (NOTE: It doesn’t work in Windows). You can increase the num_workers argument to increase the number of concurrent workers for data pre-processing.
  • (Optional) Image Pre-processing backend: (NOTE: Might involve a hectic setup in Windows. Alternative is to use OpenCV instead of Pillow.) In PyTorch, we use the TorchVision module to ease the image pre-processing. By default, TorchVision uses Pillow backend. You can replace it with Pillow-SIMD to make your image pre-processing faster.
  • CuDNN backend: Make sure to set your backend to CuDNN if you are running your training on an Nvidia GPU. Also, setup CuDNN benchmark flag to True, for most optimal performance in both training and inference. The following code block will do the magic.
if torch.cuda.is_available():
    # Sanity check - Empty CUDA Cache
    torch.cuda.empty_cache()
    # Enforce CUDNN backend
    torch.backends.cudnn.benchmark = True
    torch.backends.cudnn.enabled = True

On an RTX 2060 GPU, for training 2 epochs with a batch-size of 6 it roughly takes around 40 mins.

Please note that the above metrics are for Windows 10 OS. In my case, the bottleneck part is the data loading. You can easily speed up the DataLoader by either adding num_workers > 1 or by storing mini-batches of pre-processed intermediate tensor representations of images in pickle or h5py format. Since the scope of the DataLoader topic is outside the contents covered in this blog, we can review it in the future.

Model Inference

Here comes my favorite part! Deep Learning models (real-time) inference is probably one of the least targeted areas in open-source blogs. Honestly, I have seen very less documentation around this topic online. Since, this area requires a little bit attention, considering the growing adoption of Deep Learning models in production, I thought it would be a good idea to discuss about this topic. In the following sections, we will start with a very simple inference pipeline and try to tweak every individual element of this inference pipeline to get the desired real-time experience. Since the topics that will be covered can be new to a lot of people, I will try to keep it concise and clear. Further, to keep things simple, I am implementing everything only using Python.

Rule #1: If you are using laptops instead of Desktop or Cloud VMs, please make sure that it is connected to a constant power source and running in High Performance mode. Running the laptop on battery might lead to shortage in power supply to the GPUs and this can lead to throttling of GPU.

Simple Inference Pipeline – Webcam

Let us write a simple inference pipeline using Webcam feed as input to the style transfer model. The inference pipeline for this scenario should look like follows:

Initialize Webcam
while True:
    FETCH frame from webcam
    PREPROCESS the frame
    INFERENCE by passing frame into DL Model
    RENDER results on screen

The final code for this section is available at – LINK.

1. Initial Inference Pipeline

# Load model in eval mode
model.eval()

# Setup content transform
content_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.mul(255))
])

# Initialize the camera
camera = cv2.VideoCapture(0)

with torch.no_grad():
    while True:
        # Fetch
        _, frame = camera.read()
        # Preprocess
        content_image = content_transform(frame)
        content_image = content_image.unsqueeze(0).cuda()
        # Predict
        output = model(content_image)
        # Postprocess
        output = output.cpu().detach()[0].clamp(0, 255).numpy().transpose(1,2,0).astype("uint8")
        
        # Render results
        cv2.imshow('Frame', output)
        k = cv2.waitKey(1)
        if k==27:
            break
    camera.release()

If you run the above block, you will achieve a performance of roughly 15 FPS (i.e., roughly 63.5 milliseconds per frame). There are few ways to optimize this, either go on optimizing the model architecture by trying several types of layer combinations, layer fusions etc., or first check how optimized is your end-to-end pipeline. In the blog, we will emphasis on optimizing the end-to-end pipeline. A detailed analysis of all the experiments are provided in the Jupyter Notebook – LINK.

2. Optimize Preprocessing

In preprocessing, we convert the webcam frame from UInt8 HWC format to tensor representation (Float32 NHWC format, where N is the number of examples, H = Height of the image, W = Width of the image, C = Number of Channels in the image). In the original implementation, we use the TorchVision transforms with Pillow backend to achieve this. This implementation roughly takes 8.6 milliseconds per frame. We can speed up some of these Ops by replacing Pillow with OpenCV+Numpy Ops. Upon further investigation, we observed that converting UInt8 to Float32 is the most costly step in the preprocessing phase. But transferring Float32 to GPU is faster than transferring UInt8.

Rule #2: Minimize the Data Transfers between CPU and GPU. It is necessary to note that GPU clock cycles are slower when compared to the CPU’s. So, design your computations wisely!

Upon testing a few ideas, we concluded that transferring the UInt8 to GPU + converting it to Float32 on GPU is way faster when compared to converting UInt8 to Float32 on CPU and then transfering the Float32 to GPU. The following code block summarizes this idea.

# Preprocess
frame = frame.swapaxes(1, 2).swapaxes(0, 1) # HWC -> CHW
frame = frame[np.newaxis, :, :, :]          # CHW -> Numpy NCHW
content_image = torch.from_numpy(frame)     # Numpy -> Torch Tensor
content_image = content_image.cuda()        # CPU (UInt8) -> GPU (Byte)
content_image = content_image.type(torch.cuda.FloatTensor)

A detailed explanation of above ops with runtimes can be seen in the following notebook. In the next section, we will see how we optimized the post-processing phase of the pipeline.

3. Optimize Post-processing

Thanks to Python, our entire post-processing can be written in a single line of code output = output.cpu().detach()[0].clamp(0, 255).numpy().transpose(1,2,0).astype("uint8"). Here is what we are doing in this line of code:

  • cpu() – Copy Float32 Tensor from GPU to CPU
  • detach()[0] – Don’t track its gradients (refer this) and convert it from Tensor to CHW representation.
  • clamp(0, 255) – Clamp the values of following Array to get image pixel values.
  • numpy() – Convert it from Torch to Numpy Array.
  • transpose(1,2,0) – Just a variant of swapaxes which converts CHW to HWC.
  • astype("uint8") – Just a Numpy’s way to change data-type of the Array.

This initial implementation roughly takes around 4 milliseconds. Well, that might seem fast but if we can design the ops properly, we can actually get this entire post-processing run at 0.5 milliseconds. Here is how we can do this – clamp() is a per-element operation which can be trivially parallelized! Simply push this Op to GPU. From preprocessing optimization, we have seen that type-conversions are faster on GPU than CPU. So, simply convert the tensor from Float to Byte on GPU before transferring it to CPU. Finally, do the rest of the ops on CPU. Here is the summary of what we said in a single line of code – output = output.clamp(0, 255).type(torch.cuda.ByteTensor).cpu().detach()[0].numpy().transpose(1,2,0).

4. Async Webcam Frame Extraction

Know your hardware limits! A typical consumer webcam can provide frames only at 30 FPS. There is no way you can speed this up. But, by carefully designing this webcam frame extraction on a separate thread, you can potentially remove the overhead involved in fetching frames from webcam. That can save us up to 33.3 milliseconds which can be used for other costly operations such as model inference. You can see the VideoCaptureAsync for understanding the implementation. There are many other ways to implement this asynchronous webcam frame extraction but this article, LINK, does a really good job in explaining.

Rule #3: Try to keep data loading on a separate thread. You can use that time for heavy compute such as model inference!

5. Keeping it all together

Here is the final reference implementation combining all the things we discussed above. This implementation can speed up our inference from 15.7 FPS to 21.3 FPS, which might not like a big speed-up but we have optimized almost every phase of inference pipeline, except the model architecture optimizations. I consider model optimizations as a bit advanced topic and leaving it for future blogs.

# Load model in eval mode
model.eval()

# Setup content transform
content_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.mul(255))
])

# Initialize the camera - Async
camera = VideoCaptureAsync(0)
camera.start()

with torch.no_grad():
    while True:
        # Fetch
        _, frame = camera.read()
        # Preprocess (Optimized)
        frame = frame.swapaxes(1, 2).swapaxes(0, 1)
        frame = frame[np.newaxis, :, :, :]
        content_image = torch.from_numpy(frame)
        content_image = content_image.cuda()
        content_image = content_image.type(torch.cuda.FloatTensor)
        # Predict
        output = model(content_image)
        # Postprocess - Optimized
        output = output.clamp(0, 255).type(torch.cuda.ByteTensor).cpu().detach()[0].numpy().transpose(1,2,0)
        
        # Render results
        cv2.imshow('Frame', output)
        k = cv2.waitKey(1)
        if k==27:
            break
    camera.release()

You can further optimize this implementation by using either CUDA Streams, or asyncio, but I am leaving those topics for the user to explore. Also, if you can design your model architecture in a way that you get the most benefit out of CuDNN, you can easily achieve upto >47 FPS. I am leaving this as an exercise to the readers to experiment. The following video shows the style transfer application running real-time. Please use “HD” before playing, as wordpress’s default video encoding is deteriorating the quality.

Real-time Style Transfer (640×480 @ 47 FPS) [NOTE: Play in HD for better experience]

Thank you for reading this post, if you want to stay up-to-date with my future articles, you can subscribe by entering your email below.

Processing…
Success! You're on the list.

The source code for this blog is available at LINK.

Advertisements

Intro To PyTorch

2018 has been a revolutionary year to the field of Deep Learning. Especially with the release of various libraries and numerous features in existing libraries. Let us quickly go through some current Deep Learning Libraries: TensorFlow, PyTorch, Apache MXNet, Chainer, TuriCreate, and CNTK. There are also some wrappers written around these libraries to simplify the use and creation of deep learning architectures. Some of those wrappers include Keras and fast.ai.

My favorite part is the comfort these libraries provided to make the process of deep learning architecture design, multi-GPU/distributed training, easing the creation of custom layers and custom loss functions, model-zoo with a pool of pre-trained models, and support for converting the trained models from library specific to architecture specific {iOS, Android, and Raspberry Pi} models.

It looks like we have a lot to cover to keep ourselves on toes and meet the pace of current market trends in Deep Learning and I, after a long gap, would like to start this with an introduction to PyTorch. The contents of this blog will be as follows:

  • Environment Setup
  • Introductory (modular) code as a skeleton for DL applications.
  • Data Loading
  • Model Training and Testing
  • Visualization
  • Model Saving and Loading

For simplicity, we are covering only a subset of things in this blog to get your wheels rolling into the field. And we are aiming to get a high-level overview which can eventually help you to design your own Deep Learning ToolKit.

Environment Setup

There are many ways to set up your Deep Learning environment –

  • Cloud VMs – Microsoft Azure, Google Cloud or Amazon’s AWS provide specialized VMs with pre-configured tools and libraries that can help you with your DL/ML journey.
  • Custom Setup – You can either have a local machine with decent GPUs or a cloud VM (with OS of your choice), and setup your deep learning environment by yourself using either virtual-env or Anaconda.

My personal choice of environment setup is by using Anaconda (for both Cloud VM and Custom Setup). You can download and install Anaconda using the following link – https://www.anaconda.com/distribution/. After downloading and installing Anaconda, you can setup your working environment using the following commands –

$ conda create -n oml python=3.6
$ conda activate oml
(oml) $ conda install pytorch torchvision -c pytorch
(oml) $ conda install tensorflow-gpu
(oml) $ conda install tensorboardX

Skeleton Code for DL applications

Long gone are the days where we create a single python script to create our Deep Learning models. To efficiently train large architectures on larger datasets, it will be less painful if we follow a modular code pattern. A lot of open-source contributors are already following a specific pattern which can look overwhelming to the newcomers. For simplicity, I would like to utilize this section on how I organize my projects related to Deep Learning. Please note that it is up to the individual’s choice to organize their projects and I am just providing the most optimal structure for my workflow. Here is a skeleton of my project structure –

.
├── main.py (Entry point to the application)
├── net.py  (Net class for init/train/test DL models)
├── models/ (Directory containing different DL arch.)
|   ├── __init__.py
|   ├── <model_name>.py
|   └── ...
├── loaders/ (Directory containing data/model loaders)
|   ├── __init__.py
|   ├── data_loader.py (DataLoader class for loading data)
|   └── model_loader.py (ModelLoader class for loading model)
├── LICENSE  (License of your choice)
└── README.md(Proper documentation for Setup, Running & Results)

The code for this repo can be seen at – LINK. The main.py contains the entry point for the entire pipeline with necessary arguments to train/infer the Deep Learning models. The net.py file has the Net class which uses one of the arguments to load the dataset and model structure for further computations.

Data Loading

Loading data efficiently for training and testing used to be a large hassle once upon a time. In PyTorch, loading and handling data has become easy by using the torch.utils.data.DataLoader. Torch also supports a lot of popular Datasets for easily loading through torchvision.datasets. A sample code for loading MNIST data can be written as follows –

def loadMNIST(self, args):
    self.train_loader = torch.utils.data.DataLoader(datasets.MNIST(args.data_dir, 
                                                    train=True, download=True,
                                                    transform=transforms.Compose([
                                                    transforms.ToTensor(),
                                                    transforms.Normalize((0.1307,), (0.3081,))
                                                    ])), batch_size=args.train_batch_size, shuffle=True, **self.kwargs)
    
    self.test_loader = torch.utils.data.DataLoader(datasets.MNIST(args.data_dir, 
                                                    train=False, 
                                                    transform=transforms.Compose([
                                                    transforms.ToTensor(),
                                                    transforms.Normalize((0.1307,), (0.3081,))
                                                    ])), batch_size=args.test_batch_size, shuffle=True, **self.kwargs)

Model Training and Testing

The net.py contains the helper code for training and testing the initialized model. This is the place where we initialize the required model, load the datasets required for training and testing, loads the necessary optimizer, and load/save models. One interesting thing to consider is the _build_model(self) method. For training, we can either load the pretrained model or start training from scratch. And while loading the model, we can either choose to run the training/inference of it from the GPU (or) CPU. If multiple GPUs are available, PyTorch’s nn.DataParallel can help with easy multi-GPU training.

def _build_model(self):
    # Load the model
    _model_loader = ModelLoader(self.args)
    self.model = _model_loader.model

    # If continue_train, load the pre-trained model
    if self.args.phase == 'train':
        if self.args.continue_train:
            self.load_model()

    # If multiple GPUs are available, automatically include DataParallel
    if self.args.multi_gpu and torch.cuda.device_count() > 1:
        self.model = nn.DataParallel(self.model)
    self.model.to(self.device)

Visualization

While training large deep neural networks, it will be helpful to visualize the loss, accuracy and other important metrics that can help us to debug our networks. TensorBoard can really come into handy for this. The interesting fact is we can integrate TensorBoard into our DL pipeline made using PyTorch with the help of TensorBoardX. And the code for integrating it is as easy as –

# Initialize summary writer
self.writer = SummaryWriter('runs/{}'.format(self.args.data_name))

# Add the values to Summary Writer
self.writer.add_scalar('train/loss', loss.item(), self.args.iter_count)

You can start the TensorBoard session and run the training by using the following command –

(oml) $ tensorboard --logdir=./runs/ --host 0.0.0.0 --port 6007 & python main.py --phase train --continue_train 0

# Go to http://localhost:6007 to see the results.

Save/Load Models

It is important to save your models periodically in the middle of your training. Longer training hours may lead to some unexpected OS errors and out-of-memory errors. Saving trained models periodically can save a lot of time and resources that are invested in your training phase. PyTorch model saving and loading is as easy as –

# Save the model
torch.save(self.model.state_dict(), model_filename)

# Load the (state_dict to) model
self.model.load_state_dict(torch.load(model_filename))

Source Code – LINK

The source code for this blog is made open-source so that other DL enthusiasts can use this as a primer for their DL related projects involving PyTorch.