Development

A Dev’s Guide to Flawless Text Extraction w/ Kreuzberg v3.11

Tired of messy text extraction? Discover how Kreuzberg v3.11's new context-aware parsing and advanced OCR can revolutionize your data workflows. A dev's guide.

A

Alex Ivanov

Lead Engineer on the Kreuzberg project, passionate about clean data and developer experience.

6 min read13 views

Let’s be honest. If you’ve ever had to extract text from a mountain of PDFs or a folder of scanned images, you’ve felt the pain. You start with a glimmer of hope, maybe a slick Python script and a dash of regex. But soon, you’re knee-deep in edge cases, battling inconsistent formatting, and writing parsers so fragile they break if someone sneezes near the scanner.

You know the drill: the invoice that puts the total in the header, the report with multi-column text that reads like gibberish, or the scanned document that’s just a little too blurry for your old OCR library. It’s a frustrating, time-consuming cycle that pulls you away from what you actually want to do: build great software.

Well, it’s time to take a deep breath. We’ve been working on something special, and today, we’re thrilled to introduce Kreuzberg v3.11. This isn’t just an incremental update; it’s a fundamental leap forward in making text extraction simple, accurate, and, dare we say, enjoyable.

So, What’s New in Kreuzberg v3.11?

We listened to your feedback, your late-night bug reports, and your feature requests. Version 3.11 is packed with enhancements, but three core pillars stand out, each designed to tackle the biggest headaches in data extraction.

A Brain for Your Documents: The Context-Aware Parsing Engine

Previous extraction tools often see a document as a flat wall of text. They don’t understand that a line at the top is a header, or that a grid of numbers is a table. Our new Context-Aware Parsing Engine changes that. Powered by a sophisticated layout analysis model, Kreuzberg now understands the structure of your document.

What does this mean for you? It means you can ask for a document’s tables, and get them back as structured data. You can isolate the footer, grab all the H1 headings, or extract content from a specific column. No more guessing line numbers or writing brittle regex to find a total.

Here’s how simple it is:

from kreuzberg import Document

# Load a complex PDF invoice
doc = Document.from_file("chaotic_invoice.pdf")

# Kreuzberg understands the document's structure
# Let's grab the first table and convert it to JSON
line_items_table = doc.tables[0]
print(line_items_table.to_json())

# Need the invoice number from the header?
# The engine identifies semantic regions.
invoice_number = doc.get_region_by_label("invoice_id").text
print(f"Invoice Number: {invoice_number}")

This engine transforms unstructured documents into a predictable, queryable object model. It’s like having a DOM for your PDFs and images.

Crystal-Clear OCR for Messy Realities

Advertisement

Let’s face it, not all documents are pristine, digitally-born PDFs. More often, we’re dealing with a photo of a crumpled receipt, a fax from 1998, or a scan with a weird coffee stain. The OCR engine in v3.11 has been completely overhauled with a new deep learning model that thrives in these imperfect conditions.

Key improvements include:

  • Superior Accuracy: Drastically reduced character error rates, especially on low-resolution (under 150 DPI) and noisy images.
  • Skew and Orientation Correction: The model automatically detects and corrects for skewed pages and rotated text before processing, a common failure point for other tools.
  • Expanded Language Support: Enhanced recognition for over 120 languages, including better support for languages with complex scripts, all in the same pipeline.

“We tested v3.11 on a batch of 10,000 archived field service reports that our old system couldn’t touch. It pulled the data with over 98% accuracy on the first pass. It saved us months of manual data entry.”

Webhooks & API Enhancements: It Just... Works

A powerful tool is useless if it’s a pain to integrate. We’ve focused heavily on the developer experience, making Kreuzberg v3.11 a seamless part of your stack.

The biggest news is the introduction of asynchronous processing with webhooks. For large, multi-page documents that might take a few moments to process, you no longer need to poll for a result. Simply submit the job with a callback URL, and Kreuzberg will send you a neat JSON payload when it’s done. This is perfect for building scalable, event-driven architectures.

We’ve also added more granular API endpoints, giving you finer control over the extraction process and the ability to retrieve specific parts of a document without processing the entire file again.

Putting It to the Test: From Scanned Invoice to Structured JSON

Talk is cheap. Let’s walk through a common use case: processing a directory of scanned vendor invoices to populate a database.

First, we set up a simple script to iterate through our files and send them to the Kreuzberg API. We’ll use the new asynchronous endpoint.

import requests
import os

KREUZBERG_API_KEY = "YOUR_API_KEY"
API_ENDPOINT = "https://api.kreuzberg.ai/v3.11/process/async"

headers = {"Authorization": f"Bearer {KREUZBERG_API_KEY}"}
params = {"webhook_url": "https://yourapi.com/webhook-receiver"}

for filename in os.listdir("./invoices"):
    if filename.endswith(".pdf") or filename.endswith(".png"):
        filepath = os.path.join("./invoices", filename)
        with open(filepath, "rb") as f:
            files = {"file": (filename, f)}
            response = requests.post(API_ENDPOINT, headers=headers, params=params, files=files)
            print(f"Submitted {filename}, Job ID: {response.json()['job_id']}")

Moments later, our webhook endpoint (`https://yourapi.com/webhook-receiver`) starts receiving POST requests. The body of each request contains beautifully structured data, thanks to the Context-Aware Engine:

{
  "job_id": "job_abc123",
  "status": "completed",
  "source_file": "invoice-acme-corp.pdf",
  "data": {
    "invoice_id": "INV-2024-8817",
    "vendor_name": "Global Supplies Inc.",
    "due_date": "2025-02-15",
    "total_amount": 450.75,
    "tables": [
      {
        "id": "table_0",
        "rows": [
          {
            "description": "Industrial Widgets",
            "quantity": 15,
            "unit_price": 25.00,
            "line_total": 375.00
          },
          {
            "description": "Shipping & Handling",
            "quantity": 1,
            "unit_price": 75.75,
            "line_total": 75.75
          }
        ]
      }
    ]
  }
}

No parsing, no regex, no problem. Just clean, usable JSON ready to be inserted into your database or passed to the next service.

Kreuzberg v3.11 vs. The Old Ways

How does this new approach stack up? Let’s break it down.

Method Accuracy (Complex Layouts) Setup Time Maintenance
Kreuzberg v3.11 Very High Minutes Minimal
Custom Regex/Parsers Low to Medium Days or Weeks High (Very Brittle)
Basic OCR Libraries Low Hours Medium

Stop Fighting Your Data

Your time as a developer is valuable. It’s better spent building features, solving business problems, and creating value—not debugging a parser because a vendor changed their invoice template.

Kreuzberg v3.11 is more than just a tool; it’s a new philosophy for document intelligence. It’s about abstracting away the messy, unpredictable world of unstructured data and giving you back a clean, reliable, and structured starting point. It's about letting you focus on what you do best.

Upgrade to v3.11 today and see for yourself. We can’t wait to see what you build.

Tags

You May Also Like