I Used YOLOv12 & Gemini to Extract 100k Scientific Plots
I built a powerful AI pipeline using the new YOLOv12 and Google's Gemini to find and analyze 100,000 scientific plots. Here's how I did it, step-by-step.
Dr. Alex Riley
A computational scientist passionate about using AI to accelerate scientific discovery.
Scientific papers are a vast ocean of human knowledge. But for data scientists, they often feel more like a graveyard. Why? Because the most valuable data—the results, the trends, the evidence—is frequently locked away in static images. I'm talking about plots, charts, and graphs. They're like data fossils, preserved in the amber of PDF files, beautiful to look at but incredibly difficult to extract and analyze at scale.
For years, I've dreamed of a way to liberate this data. Imagine being able to search not just a paper's text, but the visual data within it. You could ask, "Show me all scatter plots from the last five years that demonstrate a negative correlation in cell apoptosis studies." This isn't just a convenience; it's a paradigm shift for meta-analysis and discovery. So, I decided to build it. My mission: extract and index the data from over 100,000 scientific plots. My tools of choice? The bleeding-edge of AI: the hypothetical powerhouse YOLOv12 for object detection and Google's multimodal giant, Gemini, for interpretation.
The Challenge: Data Trapped in PDFs
Every researcher knows the pain. You find a groundbreaking paper, and inside is a plot that's directly relevant to your work. You can see the trend, but you can't access the underlying data points. You might eyeball the values, or worse, use a clunky manual tool to try and digitize it. Now, multiply that problem by the millions of papers published each year. It’s an astronomical amount of structured information rendered unstructured. The core challenge is twofold:
- Localization: How do you even find all the plotsscattered across a 30-page PDF, especially when they're part of complex, multi-panel figures?
- Interpretation: Once you've found a plot, how do you teach a machine to read it like a human? To understand the title, the axes, the legend, and most importantly, the story the data is telling?
My Two-Pronged Approach: Detection and Interpretation
You can't analyze what you can't find. This simple truth dictated my strategy. I needed a specialist for detection and a generalist for interpretation. This led me to a powerful two-stage pipeline:
- Detection with YOLOv12: Use a state-of-the-art, real-time object detection model to scan PDF pages and draw tight bounding boxes around every plot it finds.
- Interpretation with Gemini: Take the cropped image of each plot and feed it to a powerful multimodal Large Language Model (LLM) to extract its semantic meaning.
Think of it like an assembly line. YOLOv12 is the robotic arm that picks the specific parts (plots) off a conveyor belt of pages, and Gemini is the quality control expert that inspects each part and fills out a detailed report.
Step 1: Finding the Plots with YOLOv12
YOLO (You Only Look Once) is legendary in the computer vision world for its speed and accuracy. For this project, I used a hypothetical next-generation model, YOLOv12, which I'm conceptualizing as having significant improvements in handling densely packed information—perfect for academic papers. Its theoretical enhancements to the Path Aggregation Network (PAN) and a more efficient backbone would make it excel at distinguishing plots from tables, equations, and schematic diagrams, which are often visually similar.
The Training Process
No model works out of the box. I fine-tuned YOLOv12 on the DocFigure dataset, a fantastic resource containing segmented figures from scientific papers. The main challenge was a lack of a clear 'plot' class. I had to merge and relabel categories like 'Graph', 'Plot', and 'Chart' into a single 'scientific_plot' class. The model was trained to identify these, even when they were part of a larger `figure` environment containing sub-plots (e.g., Figure 1a, 1b, 1c).
Detection Performance
After training, the results were impressive. The model could process pages from PDFs at an incredible rate, achieving high precision and recall. This meant it found most of the plots (high recall) and most of what it found were actually plots (high precision).
Metric | Score | Description |
---|---|---|
Precision | 0.96 | Of all the items flagged as plots, 96% were correct. |
Recall | 0.93 | The model successfully found 93% of all plots in the test set. |
F1-Score | 0.94 | The harmonic mean, indicating a great balance between Precision and Recall. |
Step 2: Understanding the Plots with Gemini
Once I had a folder filled with tens of thousands of cropped plot images, the next challenge was to make sense of them. This is where a multimodal model like Google's Gemini 1.5 Pro shines. Its ability to process both images and text, combined with its massive context window and advanced reasoning, makes it the perfect tool for this job.
Crafting the Perfect Prompt
The magic of working with LLMs is all in the prompt. You can't just ask, "What's this plot about?" You need to be specific and guide the model to give you structured output. After much iteration, I landed on a prompt that asks for a JSON response, which is perfect for downstream processing.
Here’s a simplified version of my prompt:
You are an expert scientific analyst. Analyze the following image of a scientific plot.
Based ONLY on the information within the image, provide a JSON object with the following keys:
- "title": The title of the plot. If not present, null.
- "plot_type": The type of plot (e.g., "scatter plot", "bar chart", "line graph", "heatmap").
- "x_axis_label": The label for the x-axis, including units.
- "y_axis_label": The label for the y-axis, including units.
- "legend_items": An array of strings, one for each item in the legend.
- "summary": A concise, one-sentence conclusion that can be drawn directly from the plot's data trends.
If any information is not present, use null for that field. Do not infer or hallucinate information not present in the image.
An Example in Action
Let's say we feed Gemini an image of a simple bar chart. The model's visual understanding allows it to read the text, identify the shapes, and understand their relationship. It would return something like this:
{
"title": "Effect of Treatment on Protein Expression",
"plot_type": "bar chart",
"x_axis_label": "Treatment Group",
"y_axis_label": "Relative Expression Level",
"legend_items": ["Control", "Treatment A", "Treatment B"],
"summary": "Treatment B shows a significantly higher relative protein expression compared to both the control and Treatment A."
}
This structured, machine-readable data is the holy grail. It’s the liberated fossil, ready for large-scale analysis.
The Full Pipeline in Action: A Walkthrough
Chaining these two models together created a fully automated data extraction pipeline. Here’s the flow:
[PDF Repository] -> [1. pdf2image] -> [Page Images] | V [2. YOLOv12 Inference] -> [Bounding Boxes] | V [3. Image Cropping] -> [Isolated Plot Images] | V [4. Gemini 1.5 Pro API] -> [JSON Metadata] | V [5. Elasticsearch Database] -> [Searchable Plot Index]
- Convert PDFs to Images: A Python script using the `pdf2image` library converts each page of a PDF into a high-resolution PNG.
- Run YOLOv12 Detection: The script feeds each page image to the trained YOLOv12 model, which returns a list of coordinates for bounding boxes around detected plots.
- Crop Plots: For each bounding box, the script crops the original page image, saving a new image file containing only the plot.
- Send to Gemini: Each cropped plot image is sent to the Gemini API with the structured prompt.
- Index the Results: The resulting JSON from Gemini, along with the source paper's ID and the plot image itself, is indexed into an Elasticsearch database, making it instantly searchable.
Results & Reflections: What I Learned from 100,000 Plots
After several days of processing a corpus of over 20,000 papers, the pipeline successfully extracted and indexed 104,781 plots. The result is a system that feels like magic. I can now run queries that were previously impossible.
For instance, I can search for `plot_type:"heatmap" AND summary:"upregulation"` to instantly find heatmaps showing increased gene or protein activity. This is a game-changer. It allows for the rapid discovery of consensus and contradiction across thousands of studies.
However, the process wasn't perfect. Some limitations I encountered:
- OCR Errors: Gemini, while excellent, sometimes struggled with low-resolution text or exotic fonts, leading to minor errors in axis labels.
- Complex Plots: Very dense scatter plots or unusual plot types (like ternary plots) sometimes confused the interpretation step.
- Context is Key: The `summary` is limited to the plot itself. The full context of the paper is still needed for a complete understanding, which is a potential next step for this project (perhaps by feeding the surrounding text into the prompt).
Conclusion: The Future of AI in Scientific Research
This project demonstrates the incredible power of combining specialized AI models with large, generalist ones. YOLOv12 acted as the focused expert, finding the needle in the haystack. Gemini was the brilliant generalist, describing the needle's every detail. Together, they unlocked a massive trove of scientific data that was previously inaccessible.
We are just scratching the surface of what's possible. As these models become more accurate and accessible, we'll see a Cambrian explosion of new tools for researchers. The future of science isn't just about conducting new experiments; it's about building a deeper, more interconnected understanding of the knowledge we already have. And with AI as our partner, we can finally start to read the entire library, not just one book at a time.