Faster Style Transfer – PyTorch & CuDNN

In our previous blog, we showed how to create our own mini Deep Learning pipeline to train some models using PyTorch. MNIST is pretty cool to rapidly prototype and test low level ideas in Deep Learning! It also uses very minimal compute resources to train and test models with MNIST. In this blog, let us move from MNIST dataset and try solving some other interesting challenges in Deep Learning, such as Style Transfer, and look at some stats on how we can achieve real-time inference on Desktop using only Python.

NOTE: If you would like to dive into code right way, the code for this project is available at – LINK.

For solving Artistic Style Transfer using PyTorch, let us use some data set with larger magnitude such as MS-COCO. Since there are many interesting blogs/articles online explaining Style Transfer, I would like to focus this blog more on certain tweaks to get efficient performance both in terms of Training and Inference. The rest of this blog is organized as follows – We will quickly go through the naive definition of Style Transfer, then we will use the code provided by the PyTorch examples and convert it into the pipeline we discussed in the ‘Intro To PyTorch’ blog, we will then quickly train the model with minimal hyper-parameter tuning and save the trained model. Finally we load this saved model in inference mode and use webcam feed to perform (may be) real-time style transfer.

What is Style Transfer?

Artistic Style Transfer

As the picture says it all, in the style transfer application we will train a network to convert the input (content) image into the desired style. If you are interested to learn more about this, please feel free to read the PyTorch tutorial on Neural Style.

Model Training

  • Clone the repo: StyleTransfer-PyTorch
  • Download the dataset from MS-COCO website and put it in data/ directory and use styles of your choice.
  • Your directory structure should look like this before training.
├── (Entry point to the application)
├──  (Net class for init/train/test DL models)
├── models/ (Directory containing different DL arch.)
|   ├──
|   ├── (Style Transfer Network)
|   ├── 
|   └── ...
├── loaders/ (Directory containing data/model loaders)
|   ├──
|   ├── (DataLoader class for loading data)
|   └── (ModelLoader class for loading models)
├── data/ (Directory containing data)
|   ├── coco/
|   |   └──train2014/ (Content images!)
|   └── styles/ (Directory for styles)
├── LICENSE  (License of your choice)
└── documentation for Setup, Running & Results)
  • Create the Anaconda environment using the instructions given in – Intro To PyTorch.
  • Run – python --phase train, to train the model.

Few things to consider for faster training –

  • Batch Size: You can play with the train_batch_size argument for faster training.
  • Workers for Data Loading: (NOTE: It doesn’t work in Windows). You can increase the num_workers argument to increase the number of concurrent workers for data pre-processing.
  • (Optional) Image Pre-processing backend: (NOTE: Might involve a hectic setup in Windows. Alternative is to use OpenCV instead of Pillow.) In PyTorch, we use the TorchVision module to ease the image pre-processing. By default, TorchVision uses Pillow backend. You can replace it with Pillow-SIMD to make your image pre-processing faster.
  • CuDNN backend: Make sure to set your backend to CuDNN if you are running your training on an Nvidia GPU. Also, setup CuDNN benchmark flag to True, for most optimal performance in both training and inference. The following code block will do the magic.
if torch.cuda.is_available():
    # Sanity check - Empty CUDA Cache
    # Enforce CUDNN backend
    torch.backends.cudnn.benchmark = True
    torch.backends.cudnn.enabled = True

On an RTX 2060 GPU, for training 2 epochs with a batch-size of 6 it roughly takes around 40 mins.

Please note that the above metrics are for Windows 10 OS. In my case, the bottleneck part is the data loading. You can easily speed up the DataLoader by either adding num_workers > 1 or by storing mini-batches of pre-processed intermediate tensor representations of images in pickle or h5py format. Since the scope of the DataLoader topic is outside the contents covered in this blog, we can review it in the future.

Model Inference

Here comes my favorite part! Deep Learning models (real-time) inference is probably one of the least targeted areas in open-source blogs. Honestly, I have seen very less documentation around this topic online. Since, this area requires a little bit attention, considering the growing adoption of Deep Learning models in production, I thought it would be a good idea to discuss about this topic. In the following sections, we will start with a very simple inference pipeline and try to tweak every individual element of this inference pipeline to get the desired real-time experience. Since the topics that will be covered can be new to a lot of people, I will try to keep it concise and clear. Further, to keep things simple, I am implementing everything only using Python.

Rule #1: If you are using laptops instead of Desktop or Cloud VMs, please make sure that it is connected to a constant power source and running in High Performance mode. Running the laptop on battery might lead to shortage in power supply to the GPUs and this can lead to throttling of GPU.

Simple Inference Pipeline – Webcam

Let us write a simple inference pipeline using Webcam feed as input to the style transfer model. The inference pipeline for this scenario should look like follows:

Initialize Webcam
while True:
    FETCH frame from webcam
    PREPROCESS the frame
    INFERENCE by passing frame into DL Model
    RENDER results on screen

The final code for this section is available at – LINK.

1. Initial Inference Pipeline

# Load model in eval mode

# Setup content transform
content_transform = transforms.Compose([
    transforms.Lambda(lambda x: x.mul(255))

# Initialize the camera
camera = cv2.VideoCapture(0)

with torch.no_grad():
    while True:
        # Fetch
        _, frame =
        # Preprocess
        content_image = content_transform(frame)
        content_image = content_image.unsqueeze(0).cuda()
        # Predict
        output = model(content_image)
        # Postprocess
        output = output.cpu().detach()[0].clamp(0, 255).numpy().transpose(1,2,0).astype("uint8")
        # Render results
        cv2.imshow('Frame', output)
        k = cv2.waitKey(1)
        if k==27:

If you run the above block, you will achieve a performance of roughly 15 FPS (i.e., roughly 63.5 milliseconds per frame). There are few ways to optimize this, either go on optimizing the model architecture by trying several types of layer combinations, layer fusions etc., or first check how optimized is your end-to-end pipeline. In the blog, we will emphasis on optimizing the end-to-end pipeline. A detailed analysis of all the experiments are provided in the Jupyter Notebook – LINK.

2. Optimize Preprocessing

In preprocessing, we convert the webcam frame from UInt8 HWC format to tensor representation (Float32 NHWC format, where N is the number of examples, H = Height of the image, W = Width of the image, C = Number of Channels in the image). In the original implementation, we use the TorchVision transforms with Pillow backend to achieve this. This implementation roughly takes 8.6 milliseconds per frame. We can speed up some of these Ops by replacing Pillow with OpenCV+Numpy Ops. Upon further investigation, we observed that converting UInt8 to Float32 is the most costly step in the preprocessing phase. But transferring Float32 to GPU is faster than transferring UInt8.

Rule #2: Minimize the Data Transfers between CPU and GPU. It is necessary to note that GPU clock cycles are slower when compared to the CPU’s. So, design your computations wisely!

Upon testing a few ideas, we concluded that transferring the UInt8 to GPU + converting it to Float32 on GPU is way faster when compared to converting UInt8 to Float32 on CPU and then transfering the Float32 to GPU. The following code block summarizes this idea.

# Preprocess
frame = frame.swapaxes(1, 2).swapaxes(0, 1) # HWC -> CHW
frame = frame[np.newaxis, :, :, :]          # CHW -> Numpy NCHW
content_image = torch.from_numpy(frame)     # Numpy -> Torch Tensor
content_image = content_image.cuda()        # CPU (UInt8) -> GPU (Byte)
content_image = content_image.type(torch.cuda.FloatTensor)

A detailed explanation of above ops with runtimes can be seen in the following notebook. In the next section, we will see how we optimized the post-processing phase of the pipeline.

3. Optimize Post-processing

Thanks to Python, our entire post-processing can be written in a single line of code output = output.cpu().detach()[0].clamp(0, 255).numpy().transpose(1,2,0).astype("uint8"). Here is what we are doing in this line of code:

  • cpu() – Copy Float32 Tensor from GPU to CPU
  • detach()[0] – Don’t track its gradients (refer this) and convert it from Tensor to CHW representation.
  • clamp(0, 255) – Clamp the values of following Array to get image pixel values.
  • numpy() – Convert it from Torch to Numpy Array.
  • transpose(1,2,0) – Just a variant of swapaxes which converts CHW to HWC.
  • astype("uint8") – Just a Numpy’s way to change data-type of the Array.

This initial implementation roughly takes around 4 milliseconds. Well, that might seem fast but if we can design the ops properly, we can actually get this entire post-processing run at 0.5 milliseconds. Here is how we can do this – clamp() is a per-element operation which can be trivially parallelized! Simply push this Op to GPU. From preprocessing optimization, we have seen that type-conversions are faster on GPU than CPU. So, simply convert the tensor from Float to Byte on GPU before transferring it to CPU. Finally, do the rest of the ops on CPU. Here is the summary of what we said in a single line of code – output = output.clamp(0, 255).type(torch.cuda.ByteTensor).cpu().detach()[0].numpy().transpose(1,2,0).

4. Async Webcam Frame Extraction

Know your hardware limits! A typical consumer webcam can provide frames only at 30 FPS. There is no way you can speed this up. But, by carefully designing this webcam frame extraction on a separate thread, you can potentially remove the overhead involved in fetching frames from webcam. That can save us up to 33.3 milliseconds which can be used for other costly operations such as model inference. You can see the VideoCaptureAsync for understanding the implementation. There are many other ways to implement this asynchronous webcam frame extraction but this article, LINK, does a really good job in explaining.

Rule #3: Try to keep data loading on a separate thread. You can use that time for heavy compute such as model inference!

5. Keeping it all together

Here is the final reference implementation combining all the things we discussed above. This implementation can speed up our inference from 15.7 FPS to 21.3 FPS, which might not like a big speed-up but we have optimized almost every phase of inference pipeline, except the model architecture optimizations. I consider model optimizations as a bit advanced topic and leaving it for future blogs.

# Load model in eval mode

# Setup content transform
content_transform = transforms.Compose([
    transforms.Lambda(lambda x: x.mul(255))

# Initialize the camera - Async
camera = VideoCaptureAsync(0)

with torch.no_grad():
    while True:
        # Fetch
        _, frame =
        # Preprocess (Optimized)
        frame = frame.swapaxes(1, 2).swapaxes(0, 1)
        frame = frame[np.newaxis, :, :, :]
        content_image = torch.from_numpy(frame)
        content_image = content_image.cuda()
        content_image = content_image.type(torch.cuda.FloatTensor)
        # Predict
        output = model(content_image)
        # Postprocess - Optimized
        output = output.clamp(0, 255).type(torch.cuda.ByteTensor).cpu().detach()[0].numpy().transpose(1,2,0)
        # Render results
        cv2.imshow('Frame', output)
        k = cv2.waitKey(1)
        if k==27:

You can further optimize this implementation by using either CUDA Streams, or asyncio, but I am leaving those topics for the user to explore. Also, if you can design your model architecture in a way that you get the most benefit out of CuDNN, you can easily achieve upto >47 FPS. I am leaving this as an exercise to the readers to experiment. The following video shows the style transfer application running real-time. Please use “HD” before playing, as wordpress’s default video encoding is deteriorating the quality.

Real-time Style Transfer (640×480 @ 47 FPS) [NOTE: Play in HD for better experience]

Thank you for reading this post, if you want to stay up-to-date with my future articles, you can subscribe by entering your email below.

Success! You're on the list.

The source code for this blog is available at LINK.


3 thoughts on “Faster Style Transfer – PyTorch & CuDNN

  1. Thanks for the very helpful blog :). Really useful techniques for making real time inference using Pytorch.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.