Neural Style Transfer, Classically
It so happened that for the covers of my works, I use style transfer based on a painting I really like. A free web app has been helping me with this. But sometimes its responses are poor. Plus, I worry that at some point it might go paid or bombard me with ads. So I started looking for a more reliable alternative.
My first thought was to use ChatGPT and its built-in tools. But no matter how promising the results seemed, sometimes the resources weren't enough, or it suspected me of improper use of photos. I could’ve tried another AI service or found a specialized solution, but then a second thought hit me.
The second thought - why not build my own? The task seems simple: apply the style from one image to another. The result would be an in-house solution and a better understanding of how it works “under the hood”. Long story short, I launched my IDE and started writing in Python, reminiscing about my university days.
What We’ve Got
Aside from enthusiasm, here’s what we’re working with:
▪Style image - the image from which we extract the style
▪Content image - the image we apply the style onto
▪Output image - since we have reference results from the web app, there’s a clear expectation of what the result should look like
▪Apple M1 Pro - as a workhorse machine

That’s the expected visual outcome
Since this falls under the umbrella of ML (machine learning), Python is a must. But as it turns out, the task doesn’t really require deep Python skills - especially with tools like Cursor AI in your corner.
Main Approaches
In machine learning, there are several ways to tackle the image stylization task. Here are the main ones:
▪Classic Neural Style Transfer (NST) uses deep convolutional neural networks (CNNs, such as VGG-19) to separately extract the content and style of an image. Then, through optimization (using L-BFGS or Adam), it generates a new image that minimizes the difference in style while preserving the content.
▪Fast Style Transfer replaces optimization with a trained CNN that can instantly apply a style to any image. This approach involves building a feed-forward generative network that’s pre-trained on a specific style.
▪Arbitrary Style Transfer uses Adaptive Instance Normalization (AdaIN) to dynamically adapt a style to an image on the fly - no need to train a new model each time.
▪Diffusion Models - generative models like Stable Diffusion can produce stylized images based on references, combining them with text prompts, masks, scene depth, and other conditions. This is where the future lies (think DALL-E, for example).
Lots of big words and acronyms, but this list becomes much simpler when visualized in a comparison table:

For a beginner ML engineer, classic NST is a great place to start - no need for hours of GPU training or deep knowledge of generative adversarial networks. Plus, this approach is perfect for early experimentation.
A Bit of Classic NST Theory
Classic Neural Style Transfer (NST), first introduced in 2015 (here’s the paper), uses a pre-trained VGG-19 network to extract the content and style representations of an image. VGG-19 is a deep convolutional neural network (CNN) developed by the Oxford Visual Geometry Group (VGG). It was trained on the massive ImageNet dataset for object recognition.
Here are the key steps of NST:
1.Load the images (content and style)
2.Pass them through VGG-19 to extract content and style features
3.Initialize the output image (either from the content image or random noise)
4.Compute the total loss function
5.Update the pixels using gradient descent
6.Repeat for 300–1000 iterations until the image becomes stylized
VGG-19 consists of two parts: convolutional layers (which extract image features) and fully connected layers (which perform classification). NST uses only the convolutional layers, since we’re interested in features related to content and style, not classification. And not even all the layers - only selected ones. The model acts purely as a feature extractor; its weights remain unchanged.
The loss function in NST combines content from one image with the style of another, and consists of two main components:
▪Content Loss - Goal: preserve the structure of objects
▪Style Loss - Goal: transfer texture, color, and brush strokes
Additionally, Total Variation Loss is sometimes used to smooth the output and reduce small artifacts.
Gradient descent is the optimization algorithm that minimizes the loss function - it iteratively updates the image pixels. The two most popular optimizers for NST are L-BFGS and Adam. So, in NST, it’s not the model’s weights that are optimized - it’s the pixels!
A Bit About the Implementation
To implement this, you'll need Python (version 3.9 or later), torch - the core of the PyTorch framework, and torchvision - an add-on for working with images.
I won’t go into writing the code from scratch - there are plenty of examples and tutorials out there. You might find the source code here, and below are the used parameters:

With L-BFGS running for 150 iterations and Adam for 1500, I was able to achieve a decent result.

That’s what I’d call a decent result
You could easily fork an existing implementation (this one in Lua), but I just wanted to understand the process by stepping on a few rakes myself. Next up: a summary of my experiments on the Boromirs.
Alpha and Beta
In NST, the parameters α (alpha) and β (beta) control the balance between content and style in the final image. In other words, α is the weight of the content loss, and β is the weight of the style loss.
The absolute values themselves don’t matter much - what matters is the α / β ratio. That’s why it’s common to fix α = 1 and tweak only β. The effect of style transfer becomes noticeable once β is large enough, typically starting around 1e4. If β is too high, the image gradually loses the structure of the original content - but does so beautifully.

How α / β affects the result
A value of 1e6 turned out to strike a pretty good balance.
The Contribution of Style Features
In style loss computation, you can control how much each convolutional layer contributes to the final stylization. Each layer handles something different:
▪Early layers (conv1_1, conv2_1) - handle fine textures, colors, and brush strokes
▪Mid layers (conv3_1, conv4_1) - pick up more complex patterns and shapes
▪Deep layers (conv5_1) - focus on the overall composition and global stylistic structure
Of course, I immediately wanted to crank β up to 1e9 and zero out the weights on all layers except one - just to see the effect of each individual style feature in isolation.

If you transfer only one style feature
The following conclusions can be drawn:
▪conv1_1 - Bright textural effects; lots of sharp strokes and color artifacts; shapes are distorted. This layer is very sensitive to local textures and colors but doesn’t preserve object form.
▪conv2_1 - Smoother effect; textures are still there, but softer; outlines are better preserved, though the style still has a strong presence. This layer balances between texture and form.
▪conv3_1 - Large strokes and patterns appear; outlines become clearer; the image looks painterly, but without chaotic texture. This layer captures both texture and structural style, creating a more balanced transfer.
▪conv4_1 - Minimal texture effects; overall shapes are well preserved; the style shows mostly through color shifts and lighting. This layer focuses more on composition and gives a soft stylized feel.
▪conv5_1 - Very weak style transfer, with slight noise. The layer captures global characteristics but doesn’t bring in strong textures.
For painterly artworks, these layer weights work well:
▪conv1_1 = 1.0 - Maximum weight for strong textures and brushstrokes
▪conv2_1 = 0.75 - Slightly lower, to avoid overly harsh details
▪conv3_1, conv4_1, conv5_1 = 0.2 - Lower weights to preserve the structure of the image
Different Optimizers
As mentioned earlier, the core task in classic NST is optimizing the pixels of the generated image. There are two main optimizers used for this:
▪L-BFGS - converges quickly and accurately while consuming less memory; particularly effective when optimizing the image directly.
▪Adam - an adaptive optimizer known for its stability, though it requires more memory and more iterations; it’s well-suited for GPU use and offers more flexibility.
I didn’t really test Adam’s flexibility - just went with GPT's recommended parameters and it worked by the third try. But the convergence speed was interesting to compare. While L-BFGS delivers a decent result in just 100–150 iterations, Adam typically needs 1500 or more. That said, Adam’s iterations are roughly 20x faster, at least on Apple M1 Pro.

Adam optimizer iterations
Memory usage was easy to measure via psutil, and it turned out that Adam uses about 4x more RAM for this task. In theory, L-BFGS should use more memory on large models, but in this case it’s optimizing just one parameter - a single image.

Memory usage comparison of the optimizers
Some say Adam might produce slightly less sharp results, but to me, its subtle blur effects - like soft brush strokes - actually work in favor of the stylization.
Clone, Noise, and Blend
You can run some interesting experiments with the input image passed to the optimizer. For example, the input can be a copy of the original content image, random noise (a tensor of the same shape), or even a blend - say, 90% content and 10% style, combined into a single input tensor.

Different optimizer inputs,
same number of iterations
When using noise as the starting point, it makes sense to increase α (alpha), since the content needs to be reconstructed from scratch. Naturally, this also means you’ll need more iterations.
If the initial image already contains part of the style, optimization may converge faster. However, the final result might also inherit not just the style characteristics, but some actual elements of the style image - directly transferred into the output.
Different Devices
You can run all of this on various types of devices, and in code, it’s important to explicitly specify which device PyTorch should use. Let’s go over the options and finally measure performance:
▪CPU - available on every computer. Runs very slowly, since NST involves lots of convolutions and gradient steps, which are orders of magnitude slower on CPU, especially with larger images.
▪MPS (Metal Performance Shaders, macOS GPU backend) - available on Apple Silicon (M1/M2/M3). Significantly faster than CPU, though not yet as mature or flexible as CUDA.
▪CUDA (NVIDIA GPU backend) - runs on NVIDIA GPUs. The fastest and most stable option for NST. Particularly effective with the Adam optimizer - or so they say.
▪TPU (Tensor Processing Unit) - provided by Google. Extremely fast, but designed more for inference and training models than optimizing an input tensor; PyTorch support is limited.
You can access CUDA and TPU via Google Colab - an online service by Google that lets you run Python code in the browser. The free tier doesn’t give you tons of compute units or top-tier hardware, but it’s more than enough to play around.

Running NST on different devices
A quick summary of how NST performed across devices:
▪Running on CPU isn’t worth it - you might grow old before it finishes.
▪The integrated GPU on the Apple M1 Pro performed well - roughly 1.5 to 2 times slower than Colab’s T4 GPU, which is comparable to an RTX 2060.
▪As expected, TPU (Colab v2-8) had issues: L-BFGS didn’t work at all. But Adam ran - and even faster than on free-tier CUDA.
Recap and What’s Next?
Managed to get a decent grasp of it - at least from a user’s perspective, and enough to put it down on paper. The NST algorithm is nearly nine years old, but cranking it up to DeepResearch its parameters and stress-testing it across every available device - even cloud GPUs - feels like the perfect little rabbit hole.
And this is only the beginning! Because waiting 10 minutes for a single image to render? Yeah, that’s not it. So up next is Fast Style Transfer - assuming I have enough compute units, not for pixel optimization this time, but to train a generator network. But that’s a whole different story.