← Back to Computer Vision Projects

Facial Keypoint Detection

Part 1: Nose Tip Detection

To get a feel for keypoint detection, I started off with just detecting the nose tips. Here are a few of the training data.

Original image
Original image
Original image

I trained a simple convolutional neural net with 4 convolutional layers and two fully connected layers to predict the location of the nose keypoint.

Training and Validation Losses:

Some good predictions of my model:

Original image
Original image
Original image

Some weak predictions of my model:

Original image
Original image
Original image

Notably, in the bad predictions that my model makes, the person's face is tilted or shifted. This is the "out of distribution" data in my set.

Part 2: Full Facial Keypoints Detection

This part aims to detect the full facial keypoint structure of faces. To widen the breadth of the training distribution, some random rotations (+/- 15 degrees), translations (+/- 10 pixels) and pixel jittering are induced. Below are the results and visualizations from the network.

Network Architecture

The network is a convolutional neural network (CNN) designed for facial keypoints detection. Below are the details of its architecture:

Convolutional Layers:

Layer Input Channels Output Channels Kernel Size Stride Padding Output Dimensions
Conv1 1 8 7×7 1 3 h/2 × w/2
Conv2 8 14 5×5 1 2 h/4 × w/4
Conv3 14 20 3×3 1 1 h/8 × w/8
Conv4 20 26 3×3 1 1 h/16 × w/16
Conv5 26 32 3×3 1 1 h/16 × w/16

Fully Connected Layers:

Hyperparameters:

1. Sampled Images with Ground Truth Keypoints

Data Sample 0 Data Sample 1 Data Sample 2

2. Training Loss Curve

Loss Curve

The losses depicted are on the last image of each epoch, hence the high variance in the curve. However, there is certainly a downward trajectory and some sort of convergence.

3. Examples of Good and Poor Keypoint Detection

Good Detections

Good Detection 0 Good Detection 1

Poor Detections

Bad Detection 0 Bad Detection 1

In the images where detection was very poor, there are outstanding features that resemble certain creases the model is associating with the keypoints on which it was trained. For example, in Sample 13, the lower portion of the woman's smile is conflated with her chin. We can remedy this with a larger dataset.

4. Visualized Filters

Filter C1 Filter C2 Filter C3 Filter C4 Filter C5

Part 3: Training with a Larger Dataset

Finally, we try full facial keypoint detection with a larger dataset. We use the same noising techniques as in the previous part.

Modified ResNet18 Architecture

We adapt a pre-trained ResNet18 for facial keypoint detection. Below are the modifications:

Hyperparameters:

Training Results

The following plot illustrates the training and validation loss across iterations:

Training and Validation Loss Plot

Visualization of Keypoint Predictions

Below are examples of keypoint predictions on the testing set:

Good Predictions:

Good Prediction 1 Good Prediction 2

Bad Predictions:

Bad Prediction 1 Bad Prediction 2

Sample 9: The left side of the image is significantly less exposure, causing deterioration in keypoint detection.

Sample 10: Much of the face is out of view of the camera. The portion of the keypoint detection that is incorrect in this image is the portion of the face that isn't on the physical image.

Testing on Personal Images

Here are the results of running the trained model on three personal images:

Personal Image 1 Personal Image 2 Personal Image 3 Personal Image 4

Observations:

Testing on Given Test Images

Here are the results of running the model on provided test images:

Test Image 1 Test Image 2 Test Image 3

Observations:

High Dynamic Range

Solving for the Response Function (g)

The response function $g(Z)$ describes the logarithmic relationship between pixel values ($Z$) and exposure ($X = E \cdot \Delta t$). From Equation (2) in the paper, we derive:

\[g(Z_{ij}) = \ln(E_i) + \ln(\Delta t_j)\]

Where:

To solve for $g$, we minimize a quadratic objective function:

\[O = \sum_{i=1}^N \sum_{j=1}^P w(Z_{ij}) [g(Z_{ij}) - \ln(E_i) - \ln(\Delta t_j)]^2 + \lambda \sum_{z=Z_{min}+1}^{Z_{max}-1} w(z) [g(z-1) - 2g(z) + g(z+1)]^2\]

Here, $\lambda$ is a regularization parameter controlling smoothness. The weighting function $w(Z)$ emphasizes values in the middle of the intensity range:

\[w(Z) = \begin{cases} Z - Z_{min}, & \text{if } Z \leq 0.5(Z_{min} + Z_{max}) \\ Z_{max} - Z, & \text{otherwise} \end{cases}\]

g curves are shown below

Test Image 3

Constructing the HDR Radiance Map

Once $g$ is recovered, we compute the logarithmic irradiance $\ln(E_i)$ for each pixel:

\[\ln(E_i) = \frac{\sum_{j=1}^P w(Z_{ij})(g(Z_{ij}) - \ln(\Delta t_j))}{\sum_{j=1}^P w(Z_{ij})}\]

This combines information across exposures, reducing noise and artifacts.

Bilateral Filter Decomposition

The bilateral filtering process decomposes the HDR image into base and detail layers using the following steps:

1. Logarithmic Domain Processing

First, we convert the HDR radiance map to the logarithmic domain:

\[L = \ln(I) \text{ where } I = \text{mean}(R, G, B)\]

2. Bilateral Filtering Parameters

The bilateral filter is applied with the following parameters:

The bilateral filter preserves edges while smoothing the image by combining domain and range filtering:

\[B(x) = \frac{1}{W_p}\sum_{x_i \in \Omega} G_{\sigma_s}(||x - x_i||)G_{\sigma_r}(|L(x) - L(x_i)|)L(x_i)\]

Where:

3. Layer Decomposition

The image is decomposed into:

Tone Mapping Comparison

We compare three different tone mapping approaches:

1. Global Scale (Baseline)

2. Global Operator (Reinhard)

Global tone mapping with automatic exposure adjustment:

\[E_{\text{display}} = \frac{E_{\text{world}}}{1 + E_{\text{world}}}\]

3. Local Operator (Bilateral)

Comparative Analysis

In summary, both projects were super fun! Facial keypoint detection gave me the opportunity to mess around with neural netoworks while HDR taught me some nuances of image exposure.