Operating System Development

Fixing Your IDT: Stop Double Faults After Enabling 'sti'

Stuck in a reboot loop after enabling interrupts? This guide diagnoses common causes of double faults, from PIC remapping to faulty ISRs, to get your OS back on track.

Alex Petrov

OS developer and low-level enthusiast dedicated to demystifying x86 architecture.

September 12, 20256 min read61 views

6 min read

1,636 words

61 views

Updated

You’ve been grinding away at your hobby operating system. The Global Descriptor Table (GDT) is loaded, you’ve switched to protected mode, and you can even print colorful characters to the VGA buffer. It’s time for the next giant leap: enabling interrupts. You meticulously craft your Interrupt Descriptor Table (IDT), write some basic assembly stubs for your Interrupt Service Routines (ISRs), and with a mix of excitement and trepidation, you add the final two instructions: lidt to load your IDT and sti to enable interrupts.

And then... nothing. Or rather, something terrible. Your virtual machine enters a frantic reboot loop. If you’re lucky, your double fault handler catches it, screaming messages across the screen before the inevitable triple fault and reset. The excitement evaporates, replaced by a familiar frustration. Welcome to one of the most common—and most character-building—rites of passage in OS development. You’ve just met the dreaded double fault.

The good news is you're not alone, and the problem is almost always one of a few usual suspects. This isn't some arcane magic; it's a series of logical checks. In this guide, we'll walk through a debugging checklist to diagnose why your system is faulting and get your interrupt handling on the right track.

What Exactly Is a Double Fault?

Before we start fixing things, let's quickly define the problem. A Double Fault (exception #8) is a special fault that occurs when the CPU tries to invoke an exception handler but fails. For example, if a page fault occurs, the CPU tries to jump to your page fault handler (ISR #14). But what if the IDT entry for #14 is invalid, or if the handler itself causes another fault immediately? The CPU can't handle the original fault, so it gives up and triggers a double fault instead.

If the CPU fails to invoke the double fault handler (e.g., your IDT entry for #8 is also bad), it triggers a Triple Fault. A triple fault is unrecoverable and causes the processor to reset. This is why your machine reboots. Our goal is to fix the initial problem so we never even get to the double fault stage.

The Debugging Checklist: Your Path to Stability

Let's systematically go through the most common reasons you’re double faulting right after calling sti.

1. Is Your PIC Remapped Correctly?

This is arguably the #1 culprit. By default, the Programmable Interrupt Controller (PIC 8259) is configured to send hardware interrupts (IRQs) to interrupt vectors 8-15. The problem? The CPU's own exceptions (like Double Fault #8 and General Protection Fault #13) use vectors 0-31. This creates a collision.

The moment you enable interrupts with sti, the timer (IRQ 0) is likely to fire. The PIC sends this to the CPU as interrupt vector 8. The CPU, thinking it's a Double Fault, tries to run your handler for it. If that handler isn't perfectly set up, you get another fault, leading to a real Double Fault, and then a Triple Fault reset. You must remap the PIC to send IRQs to a safe range, typically starting at vector 32.

Here’s a typical PIC remapping sequence. It involves sending a series of Initialization Command Words (ICWs) to the master and slave PICs.

// main.c
#define PIC1_COMMAND 0x20
#define PIC1_DATA    0x21
#define PIC2_COMMAND 0xA0
#define PIC2_DATA    0xA1

// Reinitialize the PIC controllers, giving them specified vector offsets
void PIC_remap(int offset1, int offset2) {
    // Start the initialization sequence in cascade mode
    outb(PIC1_COMMAND, 0x11);
    outb(PIC2_COMMAND, 0x11);

    // ICW2: Master PIC vector offset
    outb(PIC1_DATA, offset1); 
    // ICW2: Slave PIC vector offset
    outb(PIC2_DATA, offset2);

    // ICW3: tell Master PIC that there is a slave PIC at IRQ2 (0000 0100)
    outb(PIC1_DATA, 4);
    // ICW3: tell Slave PIC its cascade identity (0000 0010)
    outb(PIC2_DATA, 2);

    // ICW4: have the PICs use 8086/88 mode
    outb(PIC1_DATA, 0x01);
    outb(PIC2_DATA, 0x01);

    // Unmask all interrupts for now
    outb(PIC1_DATA, 0x0);
    outb(PIC2_DATA, 0x0);
}

// In your kernel_main(), before enabling interrupts:
PIC_remap(32, 40); // Remap IRQs to 32-47

Action: Double-check that you are remapping the PICs before loading your IDT and enabling interrupts. Ensure the offsets are clear of the first 32 CPU-reserved vectors.

2. Are Your ISR Stubs Correctly Implemented?

You cannot simply point an IDT entry to a standard C function. When an interrupt occurs, the CPU doesn't save all the general-purpose registers. Furthermore, it expects the handler to return with an iret (Interrupt Return) instruction, not a standard ret. A C function will use ret, which will pop the wrong things off the stack and lead to chaos.

You need a small assembly "stub" for each ISR. This stub's job is to:

Save all registers (pushad is your friend).
Call your C handler function.
Restore all registers (popad).
Return from the interrupt (iret).

; isr_stubs.asm
[GLOBAL isr0] ; Make it visible to C

isr0:
    cli          ; Disable interrupts while we handle this one
    pushad       ; Push all general purpose registers

    ; ... call a C function if you have one ...
    call C_interrupt_handler

    popad        ; Restore all registers
    add esp, 8   ; Clean up error code and ISR number pushed by our common stub (if any)
    sti          ; Re-enable interrupts
    iret         ; Return from interrupt

Action: Verify that every active IDT entry points to a valid assembly stub that saves/restores state and uses iret. A common mistake is to forget to declare the stubs as `GLOBAL` in assembly and `extern` in C, leading to a null pointer in the IDT entry.

3. Is Your IDT Entry Structure Flawless?

The 64-bit IDT gate descriptor is notoriously picky. One bit out of place and you'll trigger a General Protection Fault (#13), which, while handling another interrupt, becomes a Double Fault.

Here's a typical C struct for an IDT entry:

struct idt_entry {
    uint16_t base_low;    // Lower 16 bits of handler function address
    uint16_t sel;         // Kernel segment selector
    uint8_t  always0;     // This must be zero
    uint8_t  flags;       // Type and attributes
    uint16_t base_high;   // Upper 16 bits of handler function address
} __attribute__((packed));

Common mistakes in the flags byte (often called `type_attr`):

Present Bit (P): Is it set to 1? If it's 0, the entry is considered invalid, and using it causes a fault.
Descriptor Privilege Level (DPL): For kernel handlers, this should be 0.
Gate Type: It should be set to a 32-bit interrupt gate (0xE) or trap gate (0xF).

The `sel` field must be your kernel's code segment selector from the GDT (e.g., 0x08). If it's 0 or points to a data segment, you'll fault.

Action: Write a helper function to create your IDT entries and triple-check the logic. Print out the values of a few entries to be sure. Is `sel` correct? Is the present bit set? Is `always0` actually zero?

4. Are You Acknowledging the Interrupt (EOI)?

Once the PIC issues an interrupt, it waits for you to send it an "End-of-Interrupt" (EOI) signal. If you don't, it won't generate any more interrupts of the same or lower priority. This is especially critical for the timer (IRQ 0).

If your timer handler runs but doesn't send an EOI, the next timer tick won't be delivered by the PIC. While this might not cause an immediate double fault, it masks other hardware interrupts and is a ticking time bomb. The correct place to send the EOI is just before the iret in your ISR stub, after your C handler has finished.

// In your C interrupt handler for a PIC interrupt:
void handle_irq(registers_t regs) {
    // Do your work here...

    // If the interrupt came from the slave PIC, we must send an EOI to it.
    if (regs.int_no >= 40) {
        outb(PIC2_COMMAND, 0x20);
    }

    // Always send an EOI to the master PIC.
    outb(PIC1_COMMAND, 0x20);
}

Action: Ensure that for any interrupt from the PIC (vectors 32-47 by convention), your handler sends an EOI command (0x20) to the appropriate PIC controller(s) before returning.

5. Stack Sanity Check: Is It Valid?

When an interrupt fires, the CPU automatically pushes the EFLAGS register, CS, and the EIP onto the current stack. If an error code is involved, that gets pushed too. What happens if your stack pointer (ESP) is invalid, or if you've run out of stack space?

The CPU's attempt to push data onto a bad stack will cause a Page Fault or General Protection Fault. Since it's already in the middle of handling an interrupt, this new fault escalates to a Double Fault. Ensure you have allocated a sufficiently large and correctly aligned stack for your kernel before you enable interrupts.

Action: In your bootloader or kernel entry, make sure you explicitly set `esp` to the top of a known, valid memory region you've reserved for the stack. A 4KB or 16KB stack is a good starting point.

Conclusion: From Faults to Functionality

Staring down a double fault is a classic OS developer challenge. It's frustrating, but it forces you to understand the intricate dance between your software, the CPU, and legacy hardware like the PIC. By methodically working through the checklist, you can turn that reboot loop into a blinking cursor.

To recap, the five horsemen of the interrupt apocalypse are almost always:

PIC not remapped, causing collisions with CPU exceptions.
Broken ISR stubs, using `ret` instead of `iret` or corrupting the stack.
Flawed IDT entries, with incorrect flags or segment selectors.
Missing EOI signals, leaving your PIC deaf to future interrupts.
An invalid stack, causing a fault within a fault.

Fix these, and you'll have a stable foundation for a responsive, event-driven operating system. Happy hacking!