In our previous blog, we showed how to create our own mini Deep Learning pipeline to train some models using PyTorch. MNIST is pretty cool to rapidly prototype and test low level ideas in Deep Learning! It also uses very minimal compute resources to train and test models with MNIST. In this blog, let us move from MNIST dataset and try solving some other interesting challenges in Deep Learning, such as Style Transfer, and look at some stats on how we can achieve real-time inference on Desktop using only Python.
NOTE: If you would like to dive into code right away, the code for this project is available at - LINK.
For solving Artistic Style Transfer using PyTorch, let us use some data set with larger magnitude such as MS-COCO. Since there are many interesting blogs/articles online explaining Style Transfer, I would like to focus this blog more on certain tweaks to get efficient performance both in terms of Training and Inference. The rest of this blog is organized as follows - We will quickly go through the naive definition of Style Transfer, then we will use the code provided by the PyTorch examples and convert it into the pipeline we discussed in the ‘Intro To PyTorch’ blog, we will then quickly train the model with minimal hyper-parameter tuning and save the trained model. Finally we load this saved model in inference mode and use webcam feed to perform (may be) real-time style transfer.
What is Style Transfer?
As the picture says it all, in the style transfer application we will train a network to convert the input (content) image into the desired style. If you are interested to learn more about this, please feel free to read the PyTorch tutorial on Neural Style.
- Clone the repo: StyleTransfer-PyTorch
- Download the dataset from MS-COCO website and put it in
data/directory and use styles of your choice.
- Your directory structure should look like this before training.
. ├── main.py (Entry point to the application) ├── net.py (Net class for init/train/test DL models) ├── models/ (Directory containing different DL arch.) | ├── __init__.py | ├── transformer_net.py (Style Transfer Network) | ├── vgg.py | └── ... ├── loaders/ (Directory containing data/model loaders) | ├── __init__.py | ├── data_loader.py (DataLoader class for loading data) | └── model_loader.py (ModelLoader class for loading models) ├── data/ (Directory containing data) | ├── coco/ | | └──train2014/ (Content images!) | └── styles/ (Directory for styles) ├── LICENSE (License of your choice) └── README.md(Proper documentation for Setup, Running & Results)
- Create the Anaconda environment using the instructions given in - Intro To PyTorch.
- Run -
python main.py --phase train, to train the model.
Few things to consider for faster training -
- Batch Size: You can play with the
train_batch_sizeargument for faster training.
- Workers for Data Loading: (NOTE: It doesn’t work in Windows). You can increase the
num_workersargument to increase the number of concurrent workers for data pre-processing.
- (Optional) Image Pre-processing backend: (NOTE: Might involve a hectic setup in Windows. Alternative is to use OpenCV instead of Pillow.) In PyTorch, we use the TorchVision module to ease the image pre-processing. By default, TorchVision uses Pillow backend. You can replace it with Pillow-SIMD to make your image pre-processing faster.
- CuDNN backend: Make sure to set your backend to CuDNN if you are running your training on an Nvidia GPU. Also, setup CuDNN benchmark flag to True, for most optimal performance in both training and inference. The following code block will do the magic.
if torch.cuda.is_available(): # Sanity check - Empty CUDA Cache torch.cuda.empty_cache() # Enforce CUDNN backend torch.backends.cudnn.benchmark = True torch.backends.cudnn.enabled = True
On an RTX 2060 GPU, for training 2 epochs with a batch-size of 6 it roughly takes around 40 mins.
Please note that the above metrics are for Windows 10 OS. In my case, the bottleneck part is the data loading. You can easily speed up the
DataLoader by either adding
num_workers > 1 or by storing mini-batches of pre-processed intermediate tensor representations of images in
h5py format. Since the scope of the DataLoader topic is outside the contents covered in this blog, we can review it in the future.
Here comes my favorite part! Deep Learning models (real-time) inference is probably one of the least targeted areas in open-source blogs. Honestly, I have seen very less documentation around this topic online. Since, this area requires a little bit attention, considering the growing adoption of Deep Learning models in production, I thought it would be a good idea to discuss about this topic. In the following sections, we will start with a very simple inference pipeline and try to tweak every individual element of this inference pipeline to get the desired real-time experience. Since the topics that will be covered can be new to a lot of people, I will try to keep it concise and clear. Further, to keep things simple, I am implementing everything only using Python.
Rule #1: If you are using laptops instead of Desktop or Cloud VMs, please make sure that it is connected to a constant power source and running in High Performance mode. Running the laptop on battery might lead to shortage in power supply to the GPUs and this can lead to throttling of GPU.
Simple Inference Pipeline - Webcam
Let us write a simple inference pipeline using Webcam feed as input to the style transfer model. The inference pipeline for this scenario should look like follows:
Initialize Webcam while True: FETCH frame from webcam PREPROCESS the frame INFERENCE by passing frame into DL Model RENDER results on screen
The final code for this section is available at - LINK.
1. Initial Inference Pipeline
# Load model in eval mode model.eval() # Setup content transform content_transform = transforms.Compose([ transforms.ToTensor(), transforms.Lambda(lambda x: x.mul(255)) ]) # Initialize the camera camera = cv2.VideoCapture(0) with torch.no_grad(): while True: # Fetch _, frame = camera.read() # Preprocess content_image = content_transform(frame) content_image = content_image.unsqueeze(0).cuda() # Predict output = model(content_image) # Postprocess output = output.cpu().detach().clamp(0, 255).numpy().transpose(1,2,0).astype("uint8") # Render results cv2.imshow('Frame', output) k = cv2.waitKey(1) if k==27: break camera.release()
If you run the above block, you will achieve a performance of roughly 15 FPS (i.e., roughly 63.5 milliseconds per frame). There are few ways to optimize this, either go on optimizing the model architecture by trying several types of layer combinations, layer fusions etc., or first check how optimized is your end-to-end pipeline. In the blog, we will emphasis on optimizing the end-to-end pipeline. A detailed analysis of all the experiments are provided in the Jupyter Notebook - LINK.
2. Optimize Preprocessing
In preprocessing, we convert the webcam frame from UInt8 HWC format to tensor representation (Float32 NHWC format, where
N is the number of examples,
H = Height of the image,
W = Width of the image,
C = Number of Channels in the image). In the original implementation, we use the TorchVision
transforms with Pillow backend to achieve this. This implementation roughly takes 8.6 milliseconds per frame. We can speed up some of these Ops by replacing Pillow with OpenCV+Numpy Ops. Upon further investigation, we observed that converting UInt8 to Float32 is the most costly step in the preprocessing phase. But transferring Float32 to GPU is faster than transferring UInt8.
Rule #2: Minimize the Data Transfers between CPU and GPU. It is necessary to note that GPU clock cycles are slower when compared to the CPU’s. So, design your computations wisely!
Upon testing a few ideas, we concluded that transferring the UInt8 to GPU + converting it to Float32 on GPU is way faster when compared to converting UInt8 to Float32 on CPU and then transfering the Float32 to GPU. The following code block summarizes this idea.
# Preprocess frame = frame.swapaxes(1, 2).swapaxes(0, 1) # HWC -> CHW frame = frame[np.newaxis, :, :, :] # CHW -> Numpy NCHW content_image = torch.from_numpy(frame) # Numpy -> Torch Tensor content_image = content_image.cuda() # CPU (UInt8) -> GPU (Byte) content_image = content_image.type(torch.cuda.FloatTensor)
A detailed explanation of above ops with runtimes can be seen in the following notebook. In the next section, we will see how we optimized the post-processing phase of the pipeline.
3. Optimize Post-processing
Thanks to Python, our entire post-processing can be written in a single line of code -
output = output.cpu().detach().clamp(0, 255).numpy().transpose(1,2,0).astype("uint8")
Here is what we are doing in this line of code:
cpu()- Copy Float32 Tensor from GPU to CPU
detach()- Don’t track its gradients (refer this) and convert it from Tensor to
clamp(0, 255)- Clamp the values of following Array to get image pixel values.
numpy()- Convert it from Torch to Numpy Array.
transpose(1,2,0)- Just a variant of
astype("uint8")- Just a Numpy’s way to change data-type of the Array.
This initial implementation roughly takes around 4 milliseconds. Well, that might seem fast but if we can design the ops properly, we can actually get this entire post-processing run at 0.5 milliseconds. Here is how we can do this -
clamp() is a per-element operation which can be trivially parallelized! Simply push this Op to GPU. From preprocessing optimization, we have seen that type-conversions are faster on GPU than CPU. So, simply convert the tensor from Float to Byte on GPU before transferring it to CPU. Finally, do the rest of the ops on CPU. Here is the summary of what we said in a single line of code -
output = output.clamp(0, 255).type(torch.cuda.ByteTensor).cpu().detach().numpy().transpose(1,2,0)
4. Async Webcam Frame Extraction
Know your hardware limits! A typical consumer webcam can provide frames only at 30 FPS. There is no way you can speed this up. But, by carefully designing this webcam frame extraction on a separate thread, you can potentially remove the overhead involved in fetching frames from webcam. That can save us up to 33.3 milliseconds which can be used for other costly operations such as model inference. You can see the VideoCaptureAsync for understanding the implementation. There are many other ways to implement this asynchronous webcam frame extraction but this article, LINK, does a really good job in explaining.
Rule #3: Try to keep data loading on a separate thread. You can use that time for heavy compute such as model inference!
5. Keeping it all together
Here is the final reference implementation combining all the things we discussed above. This implementation can speed up our inference from 15.7 FPS to 21.3 FPS, which might not like a big speed-up but we have optimized almost every phase of inference pipeline, except the model architecture optimizations. I consider model optimizations as a bit advanced topic and leaving it for future blogs.
# Load model in eval mode model.eval() # Setup content transform content_transform = transforms.Compose([ transforms.ToTensor(), transforms.Lambda(lambda x: x.mul(255)) ]) # Initialize the camera - Async camera = VideoCaptureAsync(0) camera.start() with torch.no_grad(): while True: # Fetch _, frame = camera.read() # Preprocess (Optimized) frame = frame.swapaxes(1, 2).swapaxes(0, 1) frame = frame[np.newaxis, :, :, :] content_image = torch.from_numpy(frame) content_image = content_image.cuda() content_image = content_image.type(torch.cuda.FloatTensor) # Predict output = model(content_image) # Postprocess - Optimized output = output.clamp(0, 255).type(torch.cuda.ByteTensor).cpu().detach().numpy().transpose(1,2,0) # Render results cv2.imshow('Frame', output) k = cv2.waitKey(1) if k==27: break camera.release()
You can further optimize this implementation by using either CUDA
asyncio, but I am leaving those topics for the user to explore. Also, if you can design your model architecture in a way that you get the most benefit out of CuDNN, you can easily achieve upto >47 FPS. I am leaving this as an exercise to the readers to experiment.