Renaissance Portrait Painting to Digital Photo — Coding with StyleGAN and pSp Encoder

6 min readJun 21, 2021

Have you ever fancied the idea of appreciating the beauty of the real Mona Lisa as seen by Leonardo Da Vinci almost six centuries back? Thanks to AI, the translation of portrait paintings to photographs has been made possible and with this, your idea could become a reality. You can now have a look at how the long-gone famous personalities portrayed through the paintings might have looked like in real life.

Rapid developments in the Generative Adversarial Networks(GAN) domain have yielded astonishing applications and high quality images across a wide range of domains can be generated with ease. One of the state-of-the-art GAN architectures is the StyleGAN developed by the NVIDIA Deep Learning Institute. In addition to this, there is a stable encoder known as the pixel2style2pixel(pSp) encoder which has worked as a good supplement to StyleGAN in order to generate more realistic images.

Some underlying concepts needed to implement this idea:

The pSp encoder - The encoder directly embeds the image into a real valued vector directly in an extended intermediate space (W+). During the encoding process, the features and styles, i.e., the variable aspects are learnt and encoded in three hierarchical levels based on image details :-

the coarse (0 to 2) or the lowest level features like the edges, and outline of the facial features
the medium level features (3 to 6)
the fine features (7 to 17) , like hair, hair and skin color,

where the numbers mentioned in parentheses indicate the indices of the combined style vector containing the styles of the corresponding level

First, the three image feature maps are generated using a standard feature pyramid model over a ResNet (a CNN Architecture) backbone. The encoding process then results in the creation of 18 different 512 dimensional style vectors using a map2style mapping network consisting of small fully convolutional layers for each of these style vectors in order to extract the styles from the corresponding feature maps to be fed as input to the layers of a pre-trained StyleGAN model which makes use of these vectors to generate high quality face images. This encoder carries out a translation from input pixels to output pixels, through the intermediate style representation.

The pSp encoder helps us to encode the image without extra optimization and with minimum reconstruction error. Moreover, it captures the styles which allows multi-model image synthesis, i.e., different versions of the same basic image, by altering these style vectors.

Loss functions used in the pSp encoder:

Identity Loss (LID) - for maintaining the Facial Identity in the two domains where R is the pre-trained ArcFace network

L2 Pixel Loss (L2)- L2 norm for the difference in the corresponding pixels of the input and output images

LPIPS Loss (LPIPS)- to learn the perceptual similarities and preserve the output image quality

In summary, the total loss function is defined as-

where λ1, λ2, λ3 and λ4 are constants defining the loss weights.

2. Style Mixing:

The facial features and the styles or the variable aspects of an image can be changed by incorporating the concept of style mixing. The StyleGAN architecture mainly deals with the styles of the input image and learns it across a number of sequential and hierarchical layers or levels. As a result of which, modifying some layers results in different coarse(low), medium and high(fine) level features and styles of the image depending upon which layers of the latent vector have been altered. This alteration can be accomplished mainly in two ways:-

i) Mixing a real-valued noise vector with values sampled from the Gaussian distribution with the latent vector in some of the required layers
ii) Mixing the latent vector of the original image with that of another image bearing different styles(variations) in order to incorporate the desired styles of that image into the original one.

There is a mixing factor called alpha with a value between 0 and 1 which controls the amount of style mixing.

3. Encoder for Inversion: The pSp encoder for inversion has been used in this project which mainly learns the unique latent vector of the input image in the latent domain. Unlike other encoding mechanisms, it does the encoding without any extra optimization. This learned vector is then passed on to a StyleGAN generator which then reconstructs the input image with minimum reconstruction error. The error gets minimised due to the inclusion of the identity loss (Lid), which maintains the facial identity or the recognizable features of the input face and the LPIPS loss which preserves the perceptual similarity between the input and its reconstructed image.

4. StyleGAN

The StyleGAN model is the state-of-the-art model for generating very high resolution images. Unlike standard GAN models, where the input is a random noise vector, in StyleGAN, the noise vector is first passed through 8 fully connected layers or MLP(Multi-layer Perceptron) in order to generate an intermediate vector called W which learns a one to one mapping between the vector values and the facial features. This model carries out the following tasks:-

Progressive Growing - This is based on the ProGAN model where the model is trained progressively by generating images of increasing dimensions in a sequential fashion like 8, 64, 256, 512 and then 1024.
Adaptive Instance Normalization (AdaIn) - Instance normalization is carried out on the generated output images to extract the content part. After this, the intermediate vector (W) is made to pass through two small MLP or Fully Connected Neural networks to learn the weight and bias associated with the styles or variations in the image.
Noise Vector addition - Noise is added to the layers of the model in order to enhance the diversity of the generated images and prevent mode-collapse, which is a major issue in normal GAN models.

The properties of StyleGAN are:

Style Mixing
Stochastic Variation
Separation of global effects from stochasticity

Proposed Approach

First, the Dataset is augmented using standard Data Augmentation process and the images are cropped to contain only the face using the Deep Learning based MTCNN cropping model
A pixel2style2pixel (pSp) encoder for inversion is trained on CelebA-HQ dataset to learn the latent vectors of human faces along with the inherent style
The learned vector is made to pass through a pre-trained StyleGAN2 model trained on the popular FFHQ (Flickr Faces High Quality) Dataset to generate high resolution (1024x1024) images of faces
The portrait painting is given as input to this combined (pSp + StyleGAN2) model to generate the primary photographic version
The style mixing (changing some fine features of the face with the help of random latent vectors) to enhance the output photograph and make it look more realistic