Building Germany's First ELTIF Comparison Platform: How We Extract 50+ Data Points from Unstructured PDFs Using AI

An inside look at the technical architecture powering myELTIF.de

By SebastianDecember 202510 min read
AIDocument ExtractionELTIFFintechTechnical Architecture

The Problem: A Billion EUR Market Hidden in PDFs

When we started building myELTIF.de, we faced a paradox: European Long-Term Investment Funds (ELTIFs) are designed for retail investors, yet comparing them is nearly impossible. Each fund publishes a Key Information Document (KID) — a standardized regulatory document that should make comparison easy. But here's the catch: these documents come as PDFs, each formatted differently by the respective asset manager that publishes them.

For investors trying to compare liquidity terms, fee structures, or performance scenarios across 200+ funds? It means manually reading through thousands of pages of dense financial documents. A single fund might have 3-5 share classes, each with slightly different terms. Multiply that across the growing ELTIF market, and you have a transparency nightmare.

At myELTIF.de, we're solving this by automatically extracting structured data from these unstructured PDFs using AI. Here's how we built it.

Why This Matters

ELTIFs represent one of the most significant shifts in European capital markets since their introduction in 2015. For the first time, retail investors can access private equity, infrastructure, and private debt — asset classes previously reserved for institutions. The market is projected to grow significantly as more asset managers launch ELTIF products and the regulatory framework continues to evolve.

But there's a critical gap: transparency tooling hasn't kept pace with market growth. Without proper comparison tools, retail investors are essentially flying blind when choosing between products. A 0.5% difference in management fees compounds to tens of thousands of euros over a 7-year holding period. Understanding liquidity terms — when you can actually access your money — is critical for financial planning.

Why ELTIF KID Extraction Is Technically Challenging

Before diving into our solution, it's worth understanding what makes this problem difficult:

1. No Standard Format Despite Regulation

While KIDs follow PRIIPs regulatory guidelines, each asset manager interprets these differently. BlackRock uses clean tables and English-style naming. Goldman Sachs has nested SICAV structures. German asset managers use verbose legal language. Partners Group has multiple vintages with similar names.

2. Complex Financial Data Structures

We're not just extracting simple text fields. Performance scenarios come in multi-dimensional tables (5 scenarios × 2 time horizons × bilingual). Cost structures show projections over different time periods. Fee structures can have hurdle rates, high-water marks, and complex calculation methods.

3. The Deduplication Problem

Asset managers often issue multiple share classes of the same fund. "BlackRock Private Equity ELTIF - Class A" and "BlackRock Private Equity ELTIF - Class B" should be merged as one fund with two share classes. But "BlackRock Private Equity ELTIF" and "BlackRock Private Equity Fund" are completely different regulatory products that must remain separate.

Our Solution: A Two-Stage AI Processing Pipeline

After testing various approaches — pure OCR, single-stage LLM extraction, rules-based parsing — we landed on a hybrid architecture that combines Google Document AI for structural understanding with Google Gemini 2.5 Pro for semantic extraction.

Architecture Overview

PDF Upload
    ↓
┌─────────────────────────────────────────────┐
│ Stage 1: Document AI (EU)                   │
│ • Native PDF parsing                        │
│ • Table detection & extraction              │
│ • Produces structured text + tables         │
└─────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────┐
│ Stage 2: Gemini 2.5 Pro                     │
│ • Schema-driven extraction (50+ fields)     │
│ • Bilingual field generation (DE + EN)      │
│ • JSON-mode for structured output           │
└─────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────┐
│ Stage 3: Smart Deduplication Engine         │
│ • Confidence-based fund matching            │
│ • ISIN validation + name similarity         │
│ • Human confirmation for ambiguous cases    │
└─────────────────────────────────────────────┘
    ↓
MongoDB (Standardized fund database)

Tech Stack: FastAPI (Python) · Google Document AI · Google Gemini API · MongoDB · Alpine.js · Celery + Redis for async processing

Stage 1: Document AI for Structural Parsing

Google Document AI is our foundation. Unlike basic OCR, it understands document structure — it knows what a table is, can detect merged cells, and preserves spatial relationships between elements.

What Document AI gives us:

  • Native PDF text extraction (cleaner than OCR when possible)
  • Table detection and extraction with preserved structure
  • Performance scenario identification using keyword matching
  • Page-level confidence scores to flag low-quality scans

Document AI is specifically trained on millions of documents, so it handles layout variations that would break traditional parsers. When we tested pure text extraction (using PyMuPDF), complex tables — especially performance scenarios with merged cells — had ~40% extraction errors. Document AI reduced this to under 5%.

The output is structured JSON with separated text blocks, identified tables, and spatial metadata. This becomes the input for our semantic extraction layer.

Stage 2: Gemini for Semantic Understanding

Document AI gives us structure, but Gemini gives us understanding. This is where we map the raw document elements to our 50+ field schema.

The Data Schema

Our extraction schema has two levels:

Fund-Level Data (shared across all share classes):

  • Basic info: Fund name, manager, depositary
  • Investment details: Asset classes, substrategy, geographic focus
  • Risk: Risk indicator (1-7 scale), recommended holding period
  • Liquidity: Fund type, purchase/redemption frequency, lockup periods
  • Descriptive fields: Investment objective, fund term (all bilingual)

Share-Class-Level Data (specific to each ISIN):

  • Identity: Share class name, ISIN, currency
  • Fees: Entry, exit, management, performance fees, hurdle rates
  • Distribution policy: Accumulating vs distributing
  • Performance scenarios: 5 scenarios × 2 time horizons × bilingual
  • Cost projections: Total costs for 1 year and final year

Prompt Engineering for Accuracy

With LLMs, structured extraction depends entirely on prompt quality. Our prompt is 235 lines of carefully crafted instructions that guide Gemini through the extraction process. We provide explicit examples for handling German legal naming conventions, specify how to extract numeric values from formatted text, and define rules for distinguishing fund names from share class descriptors.

For example, German KIDs often have extremely verbose titles like "Voll eingezahlte Anteile der Klasse A ohne Nennwert am Carlyle European Tactical Private Credit ELTIF." Our prompt teaches Gemini to extract just the core fund name ("Carlyle European Tactical Private Credit ELTIF") and the share class separately ("Klasse A" / "Class A").

We use Gemini's native JSON mode (response_mime_type: "application/json") to guarantee valid JSON output, and set temperature to 0.1 for consistency across extractions.

Stage 3: The Smart Deduplication Engine

Here's a problem we didn't anticipate early enough: How do you know if "BlackRock Private Equity ELTIF - Class A" and "BlackRock Private Equity ELTIF - Class B" are share classes of the same fund versus completely different products?

Our first naive approach was "just match by ISIN!" We quickly discovered the edge cases:

  • Same ISIN, different fund names → Likely an extraction error
  • Different ISINs, same fund name → Probably different share classes (should merge)
  • Similar names, same manager → Could be different vintages or subfunds

Confidence-Based Matching Algorithm

We implemented a scoring system that evaluates multiple dimensions:

Critical Difference Detection (Highest Priority)

First, we check for differences that must mean different funds:

  • One is "ELTIF" and the other isn't → Different regulatory products
  • Different subfund indicators → Separate funds
  • If either check triggers, confidence = 0% (no match)

Similarity Scoring

For candidates that pass the critical difference check:

  • Manager match: +0.3 points
  • Fund type match (open vs closed): +0.2 points
  • Name similarity (Jaccard index): up to 0.4 points

Confidence Thresholds:

  • ≥0.9 (EXACT): Auto-merge the share classes
  • 0.75-0.9 (HIGH): Auto-merge
  • 0.6-0.75 (MEDIUM): Requires user confirmation
  • 0.4-0.6 (LOW): Requires user confirmation
  • <0.4 (NONE): Auto-create new fund

Human-in-the-Loop Confirmation

When confidence is MEDIUM or LOW, we don't guess — we ask. The system stores pending confirmations and presents a web-based UI showing:

  • New fund name vs existing fund name
  • Manager comparison
  • Existing share classes
  • Confidence score and reasoning

Users can choose "Merge" or "Create New Fund." This 2-day development investment saved us weeks of manual database cleanup.

Results: Reading KIDs at machine speed

The impact of automation is dramatic:

Manual Extraction (Before):

  • Time: 30-45 minutes per fund (with 2-3 share classes)
  • Error rate: ~10-15% (typos, misread numbers)
  • Bottleneck: Human data entry

AI Extraction (After):

  • Time: ~2 minutes per fund (fully automated)
  • Accuracy: ~95% for structured fields
  • Processing speed: 90-120 seconds per document
  • Bottleneck: Only ambiguous fund matching (handled via confirmation UI)

Lessons Learned

1. Start with Deduplication Logic Early

What happened: We built the extraction pipeline first, then realized we had duplicate funds in our database.

Impact: Spent a week refactoring and adding the confirmation UI.

Lesson: Design your deduplication strategy during schema design, not after.

2. The 80/20 Rule Applies to AI Extraction

Reality: We hit 80% accuracy in 2 weeks. Getting from 80% → 95% took 6 more weeks.

What worked:

  • Adding specific examples for edge cases
  • Post-processing cleanup for common errors
  • Constrained value enforcement for critical fields

What didn't work:

  • Trying to handle every edge case in the prompt (it got too long)
  • Over-engineering validation rules (false positives were worse than false negatives)

3. User Confirmation UI Was Worth Building

Initial plan: Auto-create new funds for all low-confidence matches.

Reality: This created duplicates for legitimate subfunds and variants.

Lesson: For ambiguous cases in data pipelines, involving a human decision-maker early beats trying to automate everything. The confirmation UI (2 days of work) saved weeks of cleanup.

4. Prompt Engineering Is Software Engineering

Our 235-line prompt has comments, examples, and explicit formatting rules. It's as important as any Python module in our codebase. We treat it like production code: version controlled, tested, documented.

Next evolution: Building a prompt testing framework with regression tests and A/B testing for changes.

Key Takeaways

If you're building a similar document extraction pipeline:

1. Hybrid Approaches Win

Don't rely on a single LLM. Combine specialized document processors (Document AI, Textract) for layout understanding, LLMs for semantic extraction, and post-processing rules for common patterns.

2. Schema Design Is Critical

A well-designed schema makes the difference between 80% and 95% accuracy:

  • Design for internationalization from day one
  • Store both human-readable text and machine-readable values
  • Use constrained enums for critical fields
  • Handle nulls properly (distinguish "missing" from "not applicable")

3. Build for Human-in-the-Loop

Even at 95% accuracy, you need ways to handle the 5%:

  • Confirmation UIs for ambiguous cases
  • Manual review queues for low-confidence extractions
  • Easy correction interfaces

Trying to achieve 100% automation is often more expensive than building good human-in-the-loop tooling.

4. Start Narrow, Then Expand

We started with 5 fields: fund name, manager, ISIN, management fee, and asset class. Once that worked reliably, we added liquidity terms, then performance scenarios, then bilingual fields. Each iteration taught us something new.

Don't try to extract 50 fields perfectly on day one.

Conclusion

Building myELTIF.de's extraction pipeline taught us that document AI in 2025 is incredibly powerful — but success requires thoughtful architecture, careful prompt engineering, and pragmatic human-in-the-loop design.

The combination of Document AI for structure, Gemini for understanding, and smart deduplication logic gets us to 95% accuracy with 90-second processing times. We've now processed 250+ ELTIF KIDs and are scaling to hundreds more.

The result: Germany's first comprehensive ELTIF comparison platform, powered by AI that reads regulatory documents faster and more accurately than any human analyst.

What's Next for myELTIF

We're currently processing 250+ ELTIF KIDs. If you're an asset manager interested in getting your ELTIFs listed with verified partner status, reach out at contact@myeltif.de.

For technical founders building similar document extraction pipelines, I'm happy to discuss our learnings — contact@myeltif.de or connect with me on LinkedIn.

About myELTIF.de: We're building Germany's first comprehensive ELTIF comparison platform, making alternative investments accessible and transparent for retail investors.

Tech Stack: Python · FastAPI · Google Document AI · Google Gemini API · MongoDB · Next.js · Tailwind CSS