Computer Vision

Achieving 89% Validation Accuracy on FER+: My Method

Struggling with the FER+ dataset? Discover my step-by-step method to achieve 89% validation accuracy using a Vision Transformer, data augmentation, and more.

A

Alex Carter

Machine learning engineer specializing in computer vision and affective computing.

7 min read23 views

Facial Emotion Recognition (FER) is one of those fascinating yet deceptively tricky problems in computer vision. We humans do it effortlessly, but teaching a machine to distinguish between subtle expressions like fear and surprise is a real challenge. The popular FER+ dataset, an improved version of FER2013 with more reliable labels, is a great benchmark for this task. After weeks of experimentation, I finally broke through a plateau and achieved 89.1% validation accuracy. It’s not state-of-the-art, but it’s a significant leap, and I want to share exactly how I did it.

There was no single magic bullet. Instead, it was a combination of three key pillars: a modern model architecture, an aggressive data augmentation strategy, and a carefully tuned training recipe. Let’s break it down.

Understanding the FER+ Challenge

Before we dive into the method, it's important to appreciate the dataset. FER+ provides grayscale images of faces, each labeled with one of eight emotions: neutral, happiness, surprise, sadness, anger, disgust, fear, and contempt. The main challenges are:

  • Class Imbalance: 'Happiness' is very common, while 'contempt' and 'disgust' are rare.
  • Subtlety: The visual difference between 'fear' and 'surprise' can be minimal.
  • Image Quality: The images vary in lighting, pose, and occlusion.

A successful model must be robust to these variations and learn the nuanced features that define each emotion.

Pillar 1: The Backbone - A Vision Transformer (ViT)

For years, Convolutional Neural Networks (CNNs) like ResNet have been the go-to for image tasks. They are fantastic at detecting local features like edges, corners, and textures. But I felt I was hitting a wall with them. The model seemed to be overfitting to local textures rather than understanding the global structure of an expression.

This led me to the Vision Transformer (ViT). Specifically, I used the vit_tiny_patch16_224 model pre-trained on ImageNet-21k. Here’s why it worked so well:

Global Context with Self-Attention

Instead of just looking at local pixel neighborhoods, ViT's self-attention mechanism allows it to weigh the importance of all parts of the image simultaneously. This means it can learn relationships between distant features, like how a furrowed brow relates to the shape of the mouth in an 'angry' expression. For emotion recognition, this holistic view is a game-changer.

The Power of Pre-training

Training a ViT from scratch is data-hungry. By using a model pre-trained on the massive ImageNet dataset, the network already has a powerful understanding of general visual patterns. My job was just to fine-tune this knowledge for the specific task of recognizing emotions. I replaced the final classification head with a new one tailored to the 8 classes of FER+ and fine-tuned the entire network.

Advertisement

Pillar 2: The Secret Sauce - Aggressive Data Augmentation

A model is only as good as the data it's trained on. With a dataset of limited size and notable class imbalance, data augmentation isn’t just helpful—it’s essential. I went beyond the basics and used a combination of techniques that forced the model to become more robust.

The Essentials

Of course, I started with the standard augmentations. These are non-negotiable for almost any image task:

  • RandomHorizontalFlip: A happy face is still a happy face when flipped.
  • RandomRotation: Small rotations to simulate head tilt.
  • ColorJitter: Adjusting brightness and contrast to handle different lighting conditions.

The Game-Changers: CutMix and Mixup

This is where I saw the biggest improvement. I used a combination of CutMix and Mixup, two powerful regularization techniques.

  • Mixup: This technique blends two random images from the dataset. If you mix an 'angry' image (70%) with a 'neutral' image (30%), the new label becomes a mix as well: 70% 'angry' and 30% 'neutral'. This creates softer decision boundaries and prevents the model from becoming overconfident.
  • CutMix: This involves cutting a patch from one image and pasting it onto another. The label is then mixed proportionally to the area of the patch. This forces the model to learn from multiple parts of the face and not rely on a single feature (like an open mouth for 'surprise'). If that feature is patched over, the model must find other clues.

Using these techniques made the training task harder for the model, but it resulted in a model that generalized far better to the unseen validation data.

Pillar 3: The Training Recipe

The final piece of the puzzle was how I put it all together during training. An excellent model and great data can still fail with a poor training strategy.

Optimizer and Scheduler

I used the AdamW optimizer, which is a robust version of the popular Adam optimizer with improved weight decay. It’s a safe and effective choice for most deep learning tasks today.

Paired with it, I used a Cosine Annealing Learning Rate Scheduler. This scheduler starts with a higher learning rate and gradually decreases it in a cosine wave pattern. This allows the model to explore the solution space broadly at the beginning and then slowly converge to a fine-tuned, stable minimum as training progresses. It’s far more effective than a simple step-wise decay.

Loss Function with Label Smoothing

I used the standard Cross-Entropy Loss, but with one crucial addition: Label Smoothing. Normally, for a 'happy' image, the target label is [0, 1, 0, 0, 0, 0, 0, 0]. Label smoothing slightly “smooths” this out, for example, to [0.01, 0.93, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]. It's a small change, but it discourages the model from becoming overly certain of its predictions, which, much like Mixup, helps it generalize better and reduces overfitting.

Putting it Together: A High-Level Look

While the full code is too long for a blog post, here’s a simplified PyTorch-style representation of the core components:


import torch
import timm # PyTorch Image Models library

# 1. The Model
model = timm.create_model('vit_tiny_patch16_224', pretrained=True, num_classes=8)

# 2. Augmentations (conceptual)
train_transforms = Compose([
    RandomHorizontalFlip(),
    RandomRotation(15),
    ColorJitter(),
    # Mixup and CutMix are often applied in the training loop itself
])

# 3. Training Recipe
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-5)
# Use LabelSmoothingCrossEntropy from timm.loss
loss_fn = LabelSmoothingCrossEntropy(smoothing=0.1)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# --- Training Loop ---
for epoch in range(num_epochs):
    for images, labels in train_loader:
        # Apply Mixup/CutMix to images and labels here
        
        optimizer.zero_grad()
        outputs = model(images)
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()
    
    scheduler.step()
    # ... validation loop ...

Results and Final Thoughts

After about 100 epochs of training, this combination of a Vision Transformer, aggressive augmentation, and a modern training recipe yielded an accuracy of 89.1% on the FER+ validation set. A look at the confusion matrix showed that the model performed exceptionally well on 'happiness' and 'surprise' but still occasionally confused 'fear' with 'sadness', a notoriously difficult distinction.

If you're working on facial emotion recognition or a similar computer vision problem, here are my key takeaways:

  1. Don't just stick to CNNs. Consider Vision Transformers for tasks where global context is important.
  2. Be aggressive with data augmentation. Techniques like Mixup and CutMix are your best friends for fighting overfitting and improving generalization.
  3. Modernize your training loop. Adopt AdamW, a cosine scheduler, and label smoothing. These small components add up to make a big difference.

I hope this breakdown of my method is helpful for your own projects. Happy coding!

Tags

You May Also Like