Kreuzberg
Added March 10, 2026 Source: Pawel Huryn
Kreuzberg helps you extract data from a wide array of document formats. It can pull out text, tables, metadata, and images, even performing OCR on scanned files. Use this when you need programmatic access to document content.
Installation
This skill has dependencies (scripts or reference files). Install using the method below to make sure everything is in place.
npx skills add kreuzberg-dev/kreuzberg --skill kreuzbergRequires Node.js 18+. The skills CLI auto-detects your editor and installs to the right directory.
Or install manually from the source repository.
SKILL.md (reference - install via npx or source for all dependencies)
---
name: kreuzberg
description: >-
Extract text, tables, metadata, and images from 75+ document formats
(PDF, Office, images, HTML, email, archives, academic) using Kreuzberg.
Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript,
Rust, or CLI. Covers installation, extraction (sync/async), configuration
(OCR, chunking, output format), batch processing, error handling, and plugins.
license: MIT
metadata:
author: kreuzberg-dev
version: "1.0"
repository: https://github.com/kreuzberg-dev/kreuzberg
---
# Kreuzberg Document Extraction
Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 75+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.
Use this skill when writing code that:
- Extracts text or metadata from documents
- Performs OCR on scanned documents or images
- Batch-processes multiple files
- Configures extraction options (output format, chunking, OCR, language detection)
- Implements custom plugins (post-processors, validators, OCR backends)
## Installation
### Python
```bash
pip install kreuzberg
# Optional OCR backends:
pip install kreuzberg[easyocr] # EasyOCR
pip install kreuzberg[paddleocr] # PaddleOCR
```
### Node.js
```bash
npm install @kreuzberg/node
```
### Rust
```toml
# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
# embeddings, language-detection, keywords-yake, keywords-rake
```
### CLI
```bash
# Download from GitHub releases, or:
cargo install kreuzberg-cli
```
## Quick Start
### Python (Async)
```python
from kreuzberg import extract_file
result = await extract_file("document.pdf")
print(result.content) # extracted text
print(result.metadata) # document metadata
print(result.tables) # extracted tables
```
### Python (Sync)
```python
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
print(result.content)
```
### Node.js
```typescript
import { extractFile } from '@kreuzberg/node';
const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);
```
### Node.js (Sync)
```typescript
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf');
```
### Rust (Async)
```rust
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file("document.pdf", None, &config).await?;
println!("{}", result.content);
Ok(())
}
```
### Rust (Sync) — requires `tokio-runtime` feature
```rust
use kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
```
### CLI
```bash
kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown
```
## Configuration
All languages use the same configuration structure with language-appropriate naming conventions.
### Python (snake_case)
```python
from kreuzberg import (
ExtractionConfig, OcrConfig, TesseractConfig,
PdfConfig, ChunkingConfig,
)
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
),
pdf_options=PdfConfig(passwords=["secret123"]),
chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
output_format="markdown",
)
result = await extract_file("document.pdf", config=config)
```
### Node.js (camelCase)
```typescript
import { extractFile, type ExtractionConfig } from '@kreuzberg/node';
const config: ExtractionConfig = {
ocr: { backend: 'tesseract', language: 'eng' },
pdfOptions: { passwords: ['secret123'] },
chunking: { maxChars: 1000, maxOverlap: 200 },
outputFormat: 'markdown',
};
const result = await extractFile('document.pdf', null, config);
```
### Rust (snake_case)
```rust
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".into(),
language: "eng".into(),
..Default::default()
}),
chunking: Some(ChunkingConfig {
max_characters: 1000,
overlap: 200,
..Default::default()
}),
output_format: OutputFormat::Markdown,
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
```
### Config File (TOML)
```toml
output_format = "markdown"
[ocr]
backend = "tesseract"
language = "eng"
[chunking]
max_chars = 1000
max_overlap = 200
[pdf_options]
passwords = ["secret123"]
```
```bash
# CLI: auto-discovers kreuzberg.toml in current/parent directories
kreuzberg extract doc.pdf
# or explicit:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
```
## Batch Processing
### Python
```python
from kreuzberg import batch_extract_files, batch_extract_files_sync
# Async
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])
# Sync
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])
for result in results:
print(f"{len(result.content)} chars extracted")
```
### Node.js
```typescript
import { batchExtractFiles } from '@kreuzberg/node';
const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);
```
### Rust — requires `tokio-runtime` feature
```rust
use kreuzberg::{batch_extract_file, ExtractionConfig};
let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;
```
### CLI
```bash
kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown
```
## OCR
OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).
### Backends
- **Tesseract** (default): Built-in native binding. All Tesseract languages supported.
- **EasyOCR** (Python only): `pip install kreuzberg[easyocr]`. Pass `easyocr_kwargs={"gpu": True}`.
- **PaddleOCR** (Python only): `pip install kreuzberg[paddleocr]`. Pass `paddleocr_kwargs={"use_angle_cls": True}`.
- **Guten** (Node.js only): Built-in OCR backend via `GutenOcrBackend`.
### Language Codes
```python
config = ExtractionConfig(ocr=OcrConfig(language="eng")) # English
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # Multiple
config = ExtractionConfig(ocr=OcrConfig(language="all")) # All installed
```
### Force OCR
```python
config = ExtractionConfig(force_ocr=True) # OCR even if text is extractable
```
## ExtractionResult Fields
| Field | Python | Node.js | Rust | Description |
|-------|--------|---------|------|-------------|
| Text content | `result.content` | `result.content` | `result.content` | Extracted text (str/String) |
| MIME type | `result.mime_type` | `result.mimeType` | `result.mime_type` | Input document MIME type |
| Metadata | `result.metadata` | `result.metadata` | `result.metadata` | Document metadata (dict/object/HashMap) |
| Tables | `result.tables` | `result.tables` | `result.tables` | Extracted tables with cells + markdown |
| Languages | `result.detected_languages` | `result.detectedLanguages` | `result.detected_languages` | Detected languages (if enabled) |
| Chunks | `result.chunks` | `result.chunks` | `result.chunks` | Text chunks (if chunking enabled) |
| Images | `result.images` | `result.images` | `result.images` | Extracted images (if enabled) |
| Elements | `result.elements` | `result.elements` | `result.elements` | Semantic elements (if element_based format) |
| Pages | `result.pages` | `result.pages` | `result.pages` | Per-page content (if page extraction enabled) |
| Keywords | `result.keywords` | `result.keywords` | `result.keywords` | Extracted keywords (if enabled) |
## Error Handling
### Python
```python
from kreuzberg import (
extract_file_sync, KreuzbergError, ParsingError,
OCRError, ValidationError, MissingDependencyError,
)
try:
result = extract_file_sync("file.pdf")
except ParsingError as e:
print(f"Failed to parse: {e}")
except OCRError as e:
print(f"OCR failed: {e}")
except ValidationError as e:
print(f"Invalid input: {e}")
except MissingDependencyError as e:
print(f"Missing dependency: {e}")
except KreuzbergError as e:
print(f"Extraction failed: {e}")
```
### Node.js
```typescript
import {
extractFile, KreuzbergError, ParsingError,
OcrError, ValidationError, MissingDependencyError,
} from '@kreuzberg/node';
try {
const result = await extractFile('file.pdf');
} catch (e) {
if (e instanceof ParsingError) { /* ... */ }
else if (e instanceof OcrError) { /* ... */ }
else if (e instanceof ValidationError) { /* ... */ }
else if (e instanceof KreuzbergError) { /* ... */ }
}
```
### Rust
```rust
use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};
let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
Ok(result) => println!("{}", result.content),
Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
Err(e) => eprintln!("Error: {e}"),
}
```
## Common Pitfalls
1. **Python ChunkingConfig fields**: Use `max_chars` and `max_overlap`, NOT `max_characters` or `overlap`.
2. **Rust extract_file signature**: Third argument is `&ExtractionConfig` (a reference), not `Option`. Use `&ExtractionConfig::default()` for defaults.
3. **Rust feature gates**: `extract_file_sync`, `batch_extract_file`, and `batch_extract_file_sync` all require `features = ["tokio-runtime"]` in Cargo.toml.
4. **Rust async context**: `extract_file` is async. Use `#[tokio::main]` or call from an async context.
5. **CLI --format vs --output-format**: `--format` controls CLI output (text/json). `--output-format` controls content format (plain/markdown/djot/html).
6. **Node.js extractFile signature**: `extractFile(path, mimeType?, config?)` — mimeType is the second arg (pass `null` to skip).
7. **Python detect_mime_type**: The function for detecting from bytes is `detect_mime_type(data)`. For paths use `detect_mime_type_from_path(path)`.
8. **Config file field names**: Use snake_case in TOML/YAML/JSON config files (e.g., `max_chars`, `max_overlap`, `pdf_options`).
## Supported Formats (Summary)
| Category | Extensions |
|----------|-----------|
| **PDF** | `.pdf` |
| **Word** | `.docx`, `.odt` |
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` |
| **Presentations** | `.pptx`, `.ppt`, `.ppsx` |
| **eBooks** | `.epub`, `.fb2` |
| **Images** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`, `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`, `.svg` |
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml` |
| **Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` |
| **Text** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` |
| **Email** | `.eml`, `.msg` |
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` |
| **Academic** | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`, `.tex`, `.latex`, `.typ`, `.jats`, `.ipynb`, `.docbook`, `.opml`, `.pod`, `.mdoc`, `.troff` |
See [references/supported-formats.md](references/supported-formats.md) for the complete format reference with MIME types.
## Additional Resources
Detailed reference files for specific topics:
- **[Python API Reference](references/python-api.md)** — All functions, config classes, plugin protocols, exact signatures
- **[Node.js API Reference](references/nodejs-api.md)** — All functions, TypeScript interfaces, worker pool APIs
- **[Rust API Reference](references/rust-api.md)** — All functions with feature gates, structs, Cargo.toml examples
- **[CLI Reference](references/cli-reference.md)** — All commands, flags, config precedence, exit codes
- **[Configuration Reference](references/configuration.md)** — TOML/YAML/JSON formats, auto-discovery, env vars, full schema
- **[Supported Formats](references/supported-formats.md)** — All 75+ formats with file extensions and MIME types
- **[Advanced Features](references/advanced-features.md)** — Plugins, embeddings, MCP server, API server, security limits
- **[Other Language Bindings](references/other-bindings.md)** — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker
Full documentation: https://docs.kreuzberg.dev
GitHub: https://github.com/kreuzberg-dev/kreuzberg
---
## Companion Files
The following reference files are included for convenience:
### references/other-bindings.md
# Language Bindings Reference
Kreuzberg provides native bindings for multiple programming languages, each with precompiled binaries for x86_64 and aarch64 on Linux and macOS. This reference covers installation and basic usage for each binding.
## Go
**Installation:**
```bash
go get github.com/kreuzberg-dev/kreuzberg/packages/go/v4
```
**Basic Extraction:**
```go
package main
import (
"context"
"fmt"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4/kreuzberg"
)
func main() {
ctx := context.Background()
result, err := kreuzberg.ExtractFile(ctx, "document.pdf", nil)
if err != nil {
panic(err)
}
fmt.Println(result.Content)
}
```
See the [Go binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go) for complete API reference.
## Ruby
**Installation:**
```bash
gem install kreuzberg
```
Or in your Gemfile:
```ruby
gem 'kreuzberg'
```
**Basic Extraction:**
```ruby
require 'kreuzberg'
result = Kreuzberg.extract_file_sync('document.pdf')
puts result.content
```
See the [Ruby binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/ruby) for complete API reference.
## Java
**Installation:**
Add to your Maven `pom.xml`:
```xml
<dependency>
<groupId>dev.kreuzberg</groupId>
<artifactId>kreuzberg</artifactId>
<version>4.2.x</version>
</dependency>
```
**Basic Extraction:**
```java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
public class Example {
public static void main(String[] args) throws Exception {
ExtractionResult result = Kreuzberg.extractFile("document.pdf");
System.out.println(result.getContent());
}
}
```
See the [Java binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/java) for complete API reference.
## C#
**Installation:**
```bash
dotnet add package Kreuzberg
```
**Basic Extraction:**
```csharp
using Kreuzberg;
var result = KreuzbergClient.ExtractFileSync("document.pdf");
Console.WriteLine(result.Content);
```
See the [C# binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/csharp) for complete API reference.
## PHP
**Installation:**
```bash
composer require kreuzberg/kreuzberg
```
**Basic Extraction:**
```php
<?php
require 'vendor/autoload.php';
use Kreuzberg\Kreuzberg;
$kreuzberg = new Kreuzberg();
$result = $kreuzberg->extractFile('document.pdf');
echo $result->content;
```
See the [PHP binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/php) for complete API reference.
## Elixir
**Installation:**
Add to your `mix.exs` dependencies:
```elixir
def deps do
[
kreuzberg: "~> 4.2"
]
end
```
**Basic Extraction:**
```elixir
{:ok, result} = Kreuzberg.extract_file("document.pdf")
IO.puts(result.content)
```
See the [Elixir binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/elixir) for complete API reference.
## WebAssembly (WASM)
**Installation:**
```bash
npm install @kreuzberg/wasm
```
**Basic Extraction:**
```typescript
import { extractBytes } from '@kreuzberg/wasm';
const fileData = await fs.promises.readFile('document.pdf');
const result = await extractBytes(fileData, 'application/pdf');
console.log(result.content);
```
Supports browsers, Deno, and Cloudflare Workers.
See the [WASM binding documentation](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/typescript) for complete API reference.
## Docker
**Installation:**
Pull the official image from GitHub Container Registry:
```bash
docker pull ghcr.io/kreuzberg-dev/kreuzberg
```
**API Server Mode:**
```bash
docker run -p 8000:8000 ghcr.io/kreuzberg-dev/kreuzberg serve --host 0.0.0.0
```
**CLI Mode:**
```bash
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg extract /data/document.pdf
```
**MCP Server Mode:**
```bash
docker run -i ghcr.io/kreuzberg-dev/kreuzberg mcp
```
Image sizes:
- Core image: 1.0-1.3GB
- Full image: ~1.0-1.3GB
See the [Docker guide](https://docs.kreuzberg.dev/guides/docker/) for deployment details.
## Platform Support
All language bindings include precompiled binaries for x86_64 and aarch64 on Linux and macOS. Windows support varies by binding. Refer to the main [README](https://github.com/kreuzberg-dev/kreuzberg) for platform compatibility matrix.
Originally by Pawel Huryn, adapted here as an Agent Skills compatible SKILL.md.
Works with
Agent Skills format — supported by 20+ editors. Learn more