My Workflow for Tagging 100k+ Plots with YOLOv12 & Gemini
Discover a powerful, real-world workflow for automatically tagging over 100,000 plots using a hybrid of YOLOv12 for detection and Gemini for analysis.
Leo Martinez
Principal ML Engineer specializing in computer vision and large-scale data processing pipelines.
We had a problem. A big one. Over 100,000 scientific papers, reports, and presentations, each containing a treasure trove of data locked away in plots and charts. How do you make that data searchable? Tagging it all by hand would take a team months, if not years. So, we automated it. Here’s the story of how we combined the raw speed of YOLOv12 with the incredible reasoning of Gemini to build an automated tagging pipeline.
The Mountain of Untagged Visuals
Imagine a digital library filled with PDFs. You can search the text, but what about the images? Specifically, the hundreds of thousands of bar charts, line graphs, scatter plots, and heatmaps that hold the most critical insights. They were effectively invisible to our search tools. We needed more than just knowing an image existed; we needed to know what it was about.
The core challenges were:
- Scale: 100,000+ documents and growing. Manual effort was a non-starter.
- Variety: The plots ranged from simple bar charts to complex, multi-panel figures with dense annotations.
- Specificity: A simple tag like "bar chart" wasn't enough. We needed to capture the title, axes, and a summary of the trend shown.
The Two-Part Strategy: Detection and Description
No single model could efficiently solve this entire problem. A large vision model like Gemini could analyze a full page, but it would be slow and expensive at this scale. A fast object detector could find the plots but couldn't understand them. The solution was a hybrid approach, breaking the problem into two distinct steps:
- Detection (The "Where"): Use a highly optimized object detection model to quickly scan each page and draw a bounding box around every plot.
- Description (The "What"): Crop the image to that bounding box and send the smaller, focused image to a powerful multimodal model for detailed analysis and tagging.
For this, our dream team was YOLOv12 and Gemini Pro Vision.
Step 1: Finding the Plots with YOLOv12 (The "Where")
YOLO (You Only Look Once) is legendary for its speed and accuracy in object detection. We chose the latest (hypothetical, as of this writing) YOLOv12 for its state-of-the-art performance on complex layouts. Our goal wasn't to classify the plot type here, just to find it. A single class, `plot`, was all we needed.
The Fine-Tuning Process
Even with a powerful pre-trained model, you need to fine-tune it on your specific domain. Our process looked like this:
- Manual Labeling (The Small Batch): We manually labeled about 1,000 plots across a diverse set of documents using a tool like Roboflow. This took a day but was crucial for accuracy.
- Training: We fine-tuned the pre-trained YOLOv12 model on our small, custom dataset. With modern frameworks, this only took a few hours on a single GPU.
- Inference: Once trained, the model could process pages at an incredible rate. We built a simple Python script to iterate through our documents, convert pages to images, and run detection.
# Simplified example of YOLOv12 inference
from ultralytics import YOLO
# Load our fine-tuned model
model = YOLO('path/to/our/best.pt')
# Run detection on a document page
results = model.predict('document_page_1.png', conf=0.6)
for result in results:
for box in result.boxes:
# Get coordinates [x1, y1, x2, y2]
coords = box.xyxy[0].tolist()
# Now we have the location of a plot!
crop_and_pass_to_gemini(coords, 'document_page_1.png')
This first step turned our massive, unstructured problem into a clean, ordered list of image crops, each containing exactly one plot.
Step 2: Understanding the Plots with Gemini (The "What")
With our plots isolated, it was time for Gemini to work its magic. Gemini's multimodal capabilities mean it can "see" an image and reason about its content just like it would with text. This is where we get the rich, descriptive metadata we need.
Prompt Engineering is Everything
The key to getting consistent, useful output from a Large Language Model is the prompt. We didn't just ask, "What is this?" We gave it a role, a clear task, and a required output format (JSON). This is a simplified version of our final prompt:
"You are an expert data analyst. Analyze the following plot image. Your task is to extract its key information and provide a summary. Respond ONLY with a valid JSON object with the following structure:
{"type": "...", "title": "...", "x_axis_label": "...", "y_axis_label": "...", "summary": "..."}
- type: Classify the plot (e.g., 'bar_chart', 'line_graph', 'scatter_plot', 'heatmap').
- title: Extract the exact title of the plot. If none, say 'N/A'.
- x_axis_label: Extract the label for the x-axis.
- y_axis_label: Extract the label for the y-axis.
- summary: Provide a 1-2 sentence summary of what the plot shows, including the main trend or key finding."
This structured prompting was a game-changer. It forced Gemini to return data we could directly parse and load into a database, eliminating any cleanup steps.
From Image to Structured Data
We fed the cropped plot image and our prompt to the Gemini Pro Vision API. For a standard bar chart showing quarterly sales, the output was beautifully structured:
{
"type": "bar_chart",
"title": "Quarterly Sales Performance Q3 2024",
"x_axis_label": "Product Category",
"y_axis_label": "Revenue (in USD millions)",
"summary": "The chart shows that the 'Software' category generated the highest revenue in Q3 2024, followed by 'Hardware', with 'Services' having the lowest revenue."
}
This is infinitely more valuable than a simple "bar_chart" tag. We now have searchable, structured metadata for every single plot.
The Full Pipeline: From Chaos to Catalog
Here’s the entire workflow from start to finish:
- Document Ingestion: A script watches a directory for new PDFs.
- Page Conversion: Each PDF page is converted into a high-resolution PNG image.
- YOLOv12 Detection: Our fine-tuned model scans the PNG and outputs bounding box coordinates for all detected plots.
- Image Cropping: For each bounding box, we crop the original PNG.
- Gemini Analysis: The cropped image is sent to the Gemini API with our structured prompt.
- Data Storage: The resulting JSON is parsed and stored in our database (we used PostgreSQL), linked to the original document and page number.
To see why this hybrid approach is so effective, here’s a quick comparison:
Approach | Speed | Accuracy | Cost | Metadata Richness |
---|---|---|---|---|
Manual Tagging | Extremely Slow | High (but prone to fatigue) | Very High (labor) | High |
YOLO-only (Classification) | Very Fast | Medium (can't read text) | Low (local compute) | Low (e.g., "bar_chart") |
Gemini-only (Full Page) | Slow | High | High (large image tokens) | Very High |
Hybrid (YOLO + Gemini) | Fast | Very High | Moderate | Very High |
The hybrid model gives us the best of all worlds: the speed of a specialized detector and the intelligence of a large vision model, all while keeping API costs manageable by only sending small, relevant image crops.
Challenges & Lessons Learned
It wasn’t all smooth sailing. Here are a few key takeaways:
- Edge Cases are Real: Some charts are weird. We had to add a human-in-the-loop verification step for low-confidence scores from either model to catch oddly formatted or extremely dense plots.
- Cost Management is Key: Running Gemini on 100k+ images isn't free. Using YOLO to pre-process and filter saved us a fortune. We also implemented batching and rate-limiting to control the flow of API calls.
- The Prompt is a Living Document: We continuously refined our prompt as we encountered new plot types to improve the consistency and accuracy of the JSON output.
Final Takeaways
Combining specialized AI models is a superpower. YOLOv12 is a scalpel, perfect for cutting through the noise with incredible speed. Gemini is a brain, capable of nuanced understanding and reasoning. Together, they allowed us to build a system that accomplished a task that was previously unthinkable at this scale.
This workflow has fundamentally changed how we interact with our data. What was once a graveyard of static images is now a rich, searchable, and intelligent catalog of insights. If you're facing a massive data-tagging challenge, I highly recommend exploring a similar hybrid approach.