Automation

I automated YouTube lecture transcripts with Python & Whisper.

Tired of scrubbing through long YouTube lectures? Learn how I built a simple Python script using OpenAI's Whisper to automatically transcribe videos into searchable text.

A

Alex Carter

A Python developer and automation enthusiast passionate about building practical AI tools.

7 min read19 views

Ever been there? You're studying for an exam, and you vaguely remember a professor explaining a key concept in a two-hour-long YouTube lecture. You spend the next 45 minutes frantically scrubbing through the video timeline, trying to pinpoint that one crucial 30-second explanation. It's like looking for a needle in a digital haystack.

I was there. A lot. As a lifelong learner who consumes a ton of educational content on YouTube, this was a constant source of friction. I knew there had to be a better way. That's when I decided to combine the power of Python with OpenAI's incredible Whisper model to build a tool that would change my study habits forever.

The Problem: Video is a Black Box

Video is a fantastic medium for learning, but it has one major flaw: it’s not easily searchable. Unlike a textbook or a blog post, you can’t just hit Ctrl+F (or Cmd+F) and find the exact term you're looking for. Your only options are:

  • Rely on the video creator adding timestamps (many don't).
  • Use YouTube's often-inaccurate auto-generated captions.
  • Manually scrub through the video, wasting precious time.
  • Transcribe it yourself, which is tedious and soul-crushing.

I needed a way to turn this opaque, linear format into a transparent, searchable document. I needed a transcript. Not just any transcript, but an accurate one, generated automatically.

The Game-Changer: Python & OpenAI's Whisper

The solution came from two of my favorite tools: Python, the versatile programming language, and Whisper, a state-of-the-art speech-to-text model from OpenAI.

Python is the glue that holds this whole project together. It's perfect for automation because of its simple syntax and a massive ecosystem of libraries that can do almost anything—including downloading YouTube videos and interacting with AI models.

OpenAI's Whisper is the real star of the show. It's an Automatic Speech Recognition (ASR) model trained on a huge dataset of diverse audio. What makes it so special? It's incredibly accurate, even with background noise, different accents, and technical jargon. It's also open-source, meaning you can run it on your own machine for free.

By combining a simple Python script with the power of Whisper, I could create a personal pipeline: YouTube URL in, clean text transcript out.

Advertisement

My Automated Transcription Workflow (Step-by-Step)

Ready to build your own? It's easier than you think. Here's exactly how I did it, broken down into simple steps. You'll need Python 3 installed on your system.

Step 1: Setting Up Your Python Environment

First, we need to install the necessary libraries. We'll need yt-dlp to download the audio from YouTube, and openai-whisper for the transcription. You'll also need ffmpeg, a command-line tool for handling audio and video. If you don't have it, you can find installation instructions on the official ffmpeg website.

Open your terminal or command prompt and run the following commands:

# Install the YouTube downloader
pip install yt-dlp

# Install Whisper and its dependencies
pip install git+https://github.com/openai/whisper.git 

# On some systems, you might need to install ffmpeg separately
# On macOS with Homebrew: brew install ffmpeg
# On Windows with Chocolatey: choco install ffmpeg
# On Debian/Ubuntu: sudo apt update && sudo apt install ffmpeg

Step 2: Downloading YouTube Audio with yt-dlp

We don't need the whole video file, just the audio. yt-dlp makes this incredibly easy. We can write a small Python function to handle this. The following code will download the best quality audio from a given YouTube URL and save it as "audio.mp3".

import yt_dlp

def download_audio(url):
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
        'outtmpl': 'audio.mp3', # Save as audio.mp3
        'quiet': False
    }
    try:
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            ydl.download([url])
        print("Audio downloaded successfully as audio.mp3")
        return True
    except Exception as e:
        print(f"Error downloading audio: {e}")
        return False

# Example usage:
# download_audio("https://www.youtube.com/watch?v=your_video_id_here")

Step 3: The Magic of Transcription with Whisper

Now that we have our audio file, we can feed it to Whisper. Whisper comes in several model sizes, each with a trade-off between speed, VRAM usage (if you have a GPU), and accuracy. For most lectures with clear audio, the `base` or `small` model is a fantastic starting point.

Here's a comparison to help you choose:

ModelVRAM RequiredRelative SpeedAccuracyBest For
tiny~1 GB~32xGoodQuick tests, low-resource devices
base~1 GB~16xGreatThe best balance for most users
small~2 GB~6xExcellentHigher accuracy with moderate resource use
medium~5 GB~2xHighly AccurateTranscribing audio with heavy accents or noise
large~10 GB1xMost AccurateMaximum accuracy when time is not an issue

Here's the Python code to load a model and transcribe our `audio.mp3` file.

import whisper

def transcribe_audio(model_name="base"):
    try:
        print(f"Loading Whisper model: {model_name}...")
        model = whisper.load_model(model_name)
        print("Model loaded. Starting transcription...")
        
        result = model.transcribe("audio.mp3")
        
        transcript = result["text"]
        print("Transcription complete.")
        
        # Save the transcript to a file
        with open("transcript.txt", "w", encoding="utf-8") as f:
            f.write(transcript)
        
        print("Transcript saved to transcript.txt")
        return transcript
    except Exception as e:
        print(f"Error during transcription: {e}")
        return None

# Example usage:
# transcribe_audio("base")

Step 4: The Complete Automation Script

Let's put it all together into one script. This script will take a YouTube URL as input, download the audio, transcribe it, save the result, and then clean up the audio file.

import whisper
import yt_dlp
import os
import time

def main():
    # --- Configuration ---
    YOUTUBE_URL = input("Please enter the YouTube video URL: ")
    MODEL_SIZE = "base"  # Options: tiny, base, small, medium, large

    # --- 1. Download Audio ---
    print("\n--- Step 1: Downloading Audio ---")
    audio_filename = "downloaded_audio.mp3"
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3'}],
        'outtmpl': audio_filename,
        'quiet': True
    }
    try:
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            print(f"Downloading audio from {YOUTUBE_URL}...")
            ydl.download([YOUTUBE_URL])
        print(f"Audio downloaded successfully as {audio_filename}")
    except Exception as e:
        print(f"Error downloading audio: {e}")
        return

    # --- 2. Transcribe Audio ---
    print("\n--- Step 2: Transcribing Audio ---")
    try:
        start_time = time.time()
        print(f"Loading Whisper model '{MODEL_SIZE}'...")
        model = whisper.load_model(MODEL_SIZE)
        print("Model loaded. Starting transcription (this may take a while)...")
        
        result = model.transcribe(audio_filename, fp16=False) # Set fp16=False if you don't have a GPU
        transcript = result["text"]
        
        end_time = time.time()
        print(f"Transcription finished in {end_time - start_time:.2f} seconds.")
        
        # --- 3. Save Transcript ---
        transcript_filename = "transcript.txt"
        with open(transcript_filename, "w", encoding="utf-8") as f:
            f.write(transcript)
        print(f"Transcript saved to {transcript_filename}")

    except Exception as e:
        print(f"Error during transcription: {e}")
    finally:
        # --- 4. Cleanup ---
        if os.path.exists(audio_filename):
            os.remove(audio_filename)
            print(f"Cleaned up audio file: {audio_filename}")

if __name__ == '__main__':
    main()

The Result: A Fully Searchable Study Guide

After running the script, you get a simple `transcript.txt` file. This plain text file is pure gold. That two-hour lecture is now a document I can open in any text editor. I can:

  • Search Instantly: Use Ctrl+F to find any keyword, name, or concept mentioned in the lecture. No more scrubbing!
  • Copy & Paste: Easily grab quotes, definitions, or code snippets for my notes.
  • Read & Skim: Quickly read through the lecture's content to get the gist or refresh my memory, which is much faster than watching it at 2x speed.
  • Summarize: I can even paste the entire transcript into a tool like ChatGPT and ask it to summarize the key points or create a study guide.

The 45 minutes I used to spend searching for one clip now takes me less than 10 seconds. It's a total productivity revolution for video-based learning.

Key Takeaways & What's Next?

This little project was a powerful reminder that with a few lines of Python, you can solve real-world frustrations.

  • Automation is Accessible: You don't need to be a software engineer to build powerful tools. Python's libraries do most of the heavy lifting.
  • Open-Source AI is a Game-Changer: Tools like Whisper, which would have been science fiction a few years ago, are now free and available for anyone to use.
  • Turn Passive Consumption into Active Learning: This script transforms video watching from a passive activity into an active one where you can engage with, search, and repurpose the material.

What's next? This script is a great foundation. You could extend it by:

  • Adding timestamp data to the transcript for even easier navigation.
  • Building a simple web interface with Streamlit or Flask.
  • Automatically processing an entire YouTube playlist.
  • Integrating a language model to auto-generate summaries or flashcards.

So next time you're stuck scrubbing through a video, remember this. You have the power to automate the solution. Happy coding!

You May Also Like