Kreuzberg

Added March 10, 2026 Source: Pawel Huryn

Kreuzberg helps you extract data from a wide array of document formats. It can pull out text, tables, metadata, and images, even performing OCR on scanned files. Use this when you need programmatic access to document content.

Installation

This skill has dependencies (scripts or reference files). Install using the method below to make sure everything is in place.

npx skills add kreuzberg-dev/kreuzberg --skill kreuzberg

Requires Node.js 18+. The skills CLI auto-detects your editor and installs to the right directory.

Or install manually from the source repository.

SKILL.md (reference - install via npx or source for all dependencies)

---
name: kreuzberg
description: >-
  Extract text, tables, metadata, and images from 75+ document formats
  (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg.
  Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript,
  Rust, or CLI. Covers installation, extraction (sync/async), configuration
  (OCR, chunking, output format), batch processing, error handling, and plugins.
license: MIT
metadata:
  author: kreuzberg-dev
  version: "1.0"
  repository: https://github.com/kreuzberg-dev/kreuzberg
---

# Kreuzberg Document Extraction

Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 75+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.

Use this skill when writing code that:
- Extracts text or metadata from documents
- Performs OCR on scanned documents or images
- Batch-processes multiple files
- Configures extraction options (output format, chunking, OCR, language detection)
- Implements custom plugins (post-processors, validators, OCR backends)

## Installation

### Python
```bash
pip install kreuzberg
# Optional OCR backends:
pip install kreuzberg[easyocr]    # EasyOCR
pip install kreuzberg[paddleocr]  # PaddleOCR
```

### Node.js
```bash
npm install @kreuzberg/node
```

### Rust
```toml
# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
#           embeddings, language-detection, keywords-yake, keywords-rake
```

### CLI
```bash
# Download from GitHub releases, or:
cargo install kreuzberg-cli
```

## Quick Start

### Python (Async)
```python
from kreuzberg import extract_file

result = await extract_file("document.pdf")
print(result.content)       # extracted text
print(result.metadata)      # document metadata
print(result.tables)        # extracted tables
```

### Python (Sync)
```python
from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")
print(result.content)
```

### Node.js
```typescript
import { extractFile } from '@kreuzberg/node';

const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);
```

### Node.js (Sync)
```typescript
import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('document.pdf');
```

### Rust (Async)
```rust
use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file("document.pdf", None, &config).await?;
    println!("{}", result.content);
    Ok(())
}
```

### Rust (Sync) — requires `tokio-runtime` feature
```rust
use kreuzberg::{extract_file_sync, ExtractionConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file_sync("document.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}
```

### CLI
```bash
kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown
```

## Configuration

All languages use the same configuration structure with language-appropriate naming conventions.

### Python (snake_case)
```python
from kreuzberg import (
    ExtractionConfig, OcrConfig, TesseractConfig,
    PdfConfig, ChunkingConfig,
)

config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        language="eng",
        tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
    ),
    pdf_options=PdfConfig(passwords=["secret123"]),
    chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
    output_format="markdown",
)

result = await extract_file("document.pdf", config=config)
```

### Node.js (camelCase)
```typescript
import { extractFile, type ExtractionConfig } from '@kreuzberg/node';

const config: ExtractionConfig = {
    ocr: { backend: 'tesseract', language: 'eng' },
    pdfOptions: { passwords: ['secret123'] },
    chunking: { maxChars: 1000, maxOverlap: 200 },
    outputFormat: 'markdown',
};

const result = await extractFile('document.pdf', null, config);
```

### Rust (snake_case)
```rust
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};

let config = ExtractionConfig {
    ocr: Some(OcrConfig {
        backend: "tesseract".into(),
        language: "eng".into(),
        ..Default::default()
    }),
    chunking: Some(ChunkingConfig {
        max_characters: 1000,
        overlap: 200,
        ..Default::default()
    }),
    output_format: OutputFormat::Markdown,
    ..Default::default()
};

let result = extract_file("document.pdf", None, &config).await?;
```

### Config File (TOML)
```toml
output_format = "markdown"

[ocr]
backend = "tesseract"
language = "eng"

[chunking]
max_chars = 1000
max_overlap = 200

[pdf_options]
passwords = ["secret123"]
```

```bash
# CLI: auto-discovers kreuzberg.toml in current/parent directories
kreuzberg extract doc.pdf
# or explicit:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
```

## Batch Processing

### Python
```python
from kreuzberg import batch_extract_files, batch_extract_files_sync

# Async
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])

# Sync
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])

for result in results:
    print(f"{len(result.content)} chars extracted")
```

### Node.js
```typescript
import { batchExtractFiles } from '@kreuzberg/node';

const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);
```

### Rust — requires `tokio-runtime` feature
```rust
use kreuzberg::{batch_extract_file, ExtractionConfig};

let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;
```

### CLI
```bash
kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown
```

## OCR

OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).

### Backends
- **Tesseract** (default): Built-in native binding. All Tesseract languages supported.
- **EasyOCR** (Python only): `pip install kreuzberg[easyocr]`. Pass `easyocr_kwargs={"gpu": True}`.
- **PaddleOCR** (Python only): `pip install kreuzberg[paddleocr]`. Pass `paddleocr_kwargs={"use_angle_cls": True}`.
- **Guten** (Node.js only): Built-in OCR backend via `GutenOcrBackend`.

### Language Codes
```python
config = ExtractionConfig(ocr=OcrConfig(language="eng"))       # English
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu"))   # Multiple
config = ExtractionConfig(ocr=OcrConfig(language="all"))       # All installed
```

### Force OCR
```python
config = ExtractionConfig(force_ocr=True)  # OCR even if text is extractable
```

## ExtractionResult Fields

| Field | Python | Node.js | Rust | Description |
|-------|--------|---------|------|-------------|
| Text content | `result.content` | `result.content` | `result.content` | Extracted text (str/String) |
| MIME type | `result.mime_type` | `result.mimeType` | `result.mime_type` | Input document MIME type |
| Metadata | `result.metadata` | `result.metadata` | `result.metadata` | Document metadata (dict/object/HashMap) |
| Tables | `result.tables` | `result.tables` | `result.tables` | Extracted tables with cells + markdown |
| Languages | `result.detected_languages` | `result.detectedLanguages` | `result.detected_languages` | Detected languages (if enabled) |
| Chunks | `result.chunks` | `result.chunks` | `result.chunks` | Text chunks (if chunking enabled) |
| Images | `result.images` | `result.images` | `result.images` | Extracted images (if enabled) |
| Elements | `result.elements` | `result.elements` | `result.elements` | Semantic elements (if element_based format) |
| Pages | `result.pages` | `result.pages` | `result.pages` | Per-page content (if page extraction enabled) |
| Keywords | `result.keywords` | `result.keywords` | `result.keywords` | Extracted keywords (if enabled) |

## Error Handling

### Python
```python
from kreuzberg import (
    extract_file_sync, KreuzbergError, ParsingError,
    OCRError, ValidationError, MissingDependencyError,
)

try:
    result = extract_file_sync("file.pdf")
except ParsingError as e:
    print(f"Failed to parse: {e}")
except OCRError as e:
    print(f"OCR failed: {e}")
except ValidationError as e:
    print(f"Invalid input: {e}")
except MissingDependencyError as e:
    print(f"Missing dependency: {e}")
except KreuzbergError as e:
    print(f"Extraction failed: {e}")
```

### Node.js
```typescript
import {
    extractFile, KreuzbergError, ParsingError,
    OcrError, ValidationError, MissingDependencyError,
} from '@kreuzberg/node';

try {
    const result = await extractFile('file.pdf');
} catch (e) {
    if (e instanceof ParsingError) { /* ... */ }
    else if (e instanceof OcrError) { /* ... */ }
    else if (e instanceof ValidationError) { /* ... */ }
    else if (e instanceof KreuzbergError) { /* ... */ }
}
```

### Rust
```rust
use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};

let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
    Ok(result) => println!("{}", result.content),
    Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
    Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
    Err(e) => eprintln!("Error: {e}"),
}
```

## Common Pitfalls

1. **Python ChunkingConfig fields**: Use `max_chars` and `max_overlap`, NOT `max_characters` or `overlap`.
2. **Rust extract_file signature**: Third argument is `&ExtractionConfig` (a reference), not `Option`. Use `&ExtractionConfig::default()` for defaults.
3. **Rust feature gates**: `extract_file_sync`, `batch_extract_file`, and `batch_extract_file_sync` all require `features = ["tokio-runtime"]` in Cargo.toml.
4. **Rust async context**: `extract_file` is async. Use `#[tokio::main]` or call from an async context.
5. **CLI --format vs --output-format**: `--format` controls CLI output (text/json). `--output-format` controls content format (plain/markdown/djot/html).
6. **Node.js extractFile signature**: `extractFile(path, mimeType?, config?)` — mimeType is the second arg (pass `null` to skip).
7. **Python detect_mime_type**: The function for detecting from bytes is `detect_mime_type(data)`. For paths use `detect_mime_type_from_path(path)`.
8. **Config file field names**: Use snake_case in TOML/YAML/JSON config files (e.g., `max_chars`, `max_overlap`, `pdf_options`).

## Supported Formats (Summary)

| Category | Extensions |
|----------|-----------|
| **PDF** | `.pdf` |
| **Word** | `.docx`, `.odt` |
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` |
| **Presentations** | `.pptx`, `.ppt`, `.ppsx` |
| **eBooks** | `.epub`, `.fb2` |
| **Images** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`, `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`, `.svg` |
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml` |
| **Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` |
| **Text** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` |
| **Email** | `.eml`, `.msg` |
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` |
| **Academic** | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`, `.tex`, `.latex`, `.typ`, `.jats`, `.ipynb`, `.docbook`, `.opml`, `.pod`, `.mdoc`, `.troff` |

See [references/supported-formats.md](references/supported-formats.md) for the complete format reference with MIME types.

## Additional Resources

Detailed reference files for specific topics:

- **[Python API Reference](references/python-api.md)** — All functions, config classes, plugin protocols, exact signatures
- **[Node.js API Reference](references/nodejs-api.md)** — All functions, TypeScript interfaces, worker pool APIs
- **[Rust API Reference](references/rust-api.md)** — All functions with feature gates, structs, Cargo.toml examples
- **[CLI Reference](references/cli-reference.md)** — All commands, flags, config precedence, exit codes
- **[Configuration Reference](references/configuration.md)** — TOML/YAML/JSON formats, auto-discovery, env vars, full schema
- **[Supported Formats](references/supported-formats.md)** — All 75+ formats with file extensions and MIME types
- **[Advanced Features](references/advanced-features.md)** — Plugins, embeddings, MCP server, API server, security limits
- **[Other Language Bindings](references/other-bindings.md)** — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker

Full documentation: https://docs.kreuzberg.dev
GitHub: https://github.com/kreuzberg-dev/kreuzberg


---

## Companion Files

The following reference files are included for convenience:

### references/other-bindings.md

# Language Bindings Reference

Kreuzberg provides native bindings for multiple programming languages, each with precompiled binaries for x86_64 and aarch64 on Linux and macOS. This reference covers installation and basic usage for each binding.

## Go

**Installation:**
```bash
go get github.com/kreuzberg-dev/kreuzberg/packages/go/v4
```

**Basic Extraction:**
```go
package main

import (
    "context"
    "fmt"
    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4/kreuzberg"
)

func main() {
    ctx := context.Background()
    result, err := kreuzberg.ExtractFile(ctx, "document.pdf", nil)
    if err != nil {
        panic(err)
    }
    fmt.Println(result.Content)
}
```

See the [Go binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go) for complete API reference.

## Ruby

**Installation:**
```bash
gem install kreuzberg
```

Or in your Gemfile:
```ruby
gem 'kreuzberg'
```

**Basic Extraction:**
```ruby
require 'kreuzberg'

result = Kreuzberg.extract_file_sync('document.pdf')
puts result.content
```

See the [Ruby binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/ruby) for complete API reference.

## Java

**Installation:**
Add to your Maven `pom.xml`:
```xml
<dependency>
    <groupId>dev.kreuzberg</groupId>
    <artifactId>kreuzberg</artifactId>
    <version>4.2.x</version>
</dependency>
```

**Basic Extraction:**
```java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;

public class Example {
    public static void main(String[] args) throws Exception {
        ExtractionResult result = Kreuzberg.extractFile("document.pdf");
        System.out.println(result.getContent());
    }
}
```

See the [Java binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/java) for complete API reference.

## C#

**Installation:**
```bash
dotnet add package Kreuzberg
```

**Basic Extraction:**
```csharp
using Kreuzberg;

var result = KreuzbergClient.ExtractFileSync("document.pdf");
Console.WriteLine(result.Content);
```

See the [C# binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/csharp) for complete API reference.

## PHP

**Installation:**
```bash
composer require kreuzberg/kreuzberg
```

**Basic Extraction:**
```php
<?php
require 'vendor/autoload.php';

use Kreuzberg\Kreuzberg;

$kreuzberg = new Kreuzberg();
$result = $kreuzberg->extractFile('document.pdf');
echo $result->content;
```

See the [PHP binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/php) for complete API reference.

## Elixir

**Installation:**
Add to your `mix.exs` dependencies:
```elixir
def deps do
  [
    kreuzberg: "~> 4.2"
  ]
end
```

**Basic Extraction:**
```elixir
{:ok, result} = Kreuzberg.extract_file("document.pdf")
IO.puts(result.content)
```

See the [Elixir binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/elixir) for complete API reference.

## WebAssembly (WASM)

**Installation:**
```bash
npm install @kreuzberg/wasm
```

**Basic Extraction:**
```typescript
import { extractBytes } from '@kreuzberg/wasm';

const fileData = await fs.promises.readFile('document.pdf');
const result = await extractBytes(fileData, 'application/pdf');
console.log(result.content);
```

Supports browsers, Deno, and Cloudflare Workers.

See the [WASM binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/typescript) for complete API reference.

## Docker

**Installation:**
Pull the official image from GitHub Container Registry:
```bash
docker pull ghcr.io/kreuzberg-dev/kreuzberg
```

**API Server Mode:**
```bash
docker run -p 8000:8000 ghcr.io/kreuzberg-dev/kreuzberg serve --host 0.0.0.0
```

**CLI Mode:**
```bash
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg extract /data/document.pdf
```

**MCP Server Mode:**
```bash
docker run -i ghcr.io/kreuzberg-dev/kreuzberg mcp
```

Image sizes:
- Core image: 1.0-1.3GB
- Full image: ~1.0-1.3GB

See the [Docker guide](https://docs.kreuzberg.dev/guides/docker/) for deployment details.

## Platform Support

All language bindings include precompiled binaries for x86_64 and aarch64 on Linux and macOS. Windows support varies by binding. Refer to the main [README](https://github.com/kreuzberg-dev/kreuzberg) for platform compatibility matrix.

Originally by Pawel Huryn, adapted here as an Agent Skills compatible SKILL.md.

Works with

Claude Code Cursor Codex CLI Gemini CLI VS Code Windsurf Amp Roo Code Goose Trae OpenCode Spring AI

Agent Skills format — supported by 20+ editors. Learn more