Files
fil/docs/reference/file-size-limits.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

566 lines
15 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# File Size Limits
Kreuzberg enforces size limits on file uploads and API requests to manage server resources effectively. This page documents the default limits, how to configure them, and recommendations for optimal performance.
## Overview
File size limits protect your server from resource exhaustion and unexpected memory spikes. The Kreuzberg API implements two complementary limit types:
| Limit Type | Purpose | Default |
| ------------------------- | ------------------------------------------- | ------- |
| **Request Body Limit** | Total size of all files in a single request | 100 MB |
| **Multipart Field Limit** | Maximum size of an individual file | 100 MB |
Both limits are configurable via environment variables (`KREUZBERG_MAX_REQUEST_BODY_BYTES`, `KREUZBERG_MAX_MULTIPART_FIELD_BYTES`) or programmatically via the `ApiSizeLimits` type.
## Default Configuration
### Default Limits: 100 MB
The default configuration allows:
- **Total request body:** 100 MB (104,857,600 bytes)
- **Individual file:** 100 MB (104,857,600 bytes)
These defaults are suitable for typical document processing workloads including:
- Standard PDF documents and scanned pages
- Office documents (Word, Excel, PowerPoint)
- High-resolution images
- Single document uploads and small batches
### When to Increase
Increase limits to process:
- **Large scanned document archives** (200+ MB)
- **High-resolution images** (50+ MB each)
- **Video presentations** (500+ MB)
- **Bulk batch uploads** (multiple 50 MB documents)
### When to Decrease
Decrease limits if:
- You want to enforce strict file size policies
- Your server has limited memory
- You're processing only small structured documents
- You need to rate-limit aggressive clients
## Configuration Methods
### 1. Environment Variable (Simplest)
Set the `KREUZBERG_MAX_MULTIPART_FIELD_BYTES` environment variable to specify the max multipart field size in bytes:
```bash title="Terminal"
# Set to 200 MB
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=209715200
kreuzberg serve -H 0.0.0.0 -p 8000
# Set to 500 MB for large documents
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=524288000
kreuzberg serve -H 0.0.0.0 -p 8000
```
### 2. Docker Compose
Configure limits in your Docker Compose setup:
```yaml title="docker-compose.yaml"
version: "3.8"
services:
kreuzberg-api:
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
ports:
- "8000:8000"
environment:
# Set maximum multipart field size to 500 MB
KREUZBERG_MAX_MULTIPART_FIELD_BYTES: "524288000"
# Configure CORS for production
KREUZBERG_CORS_ORIGINS: "https://myapp.com,https://api.myapp.com"
volumes:
- ./cache:/app/.kreuzberg
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
```
### 3. Kubernetes Deployment
Configure size limits in your Kubernetes deployment:
```yaml title="kubernetes-deployment.yaml"
apiVersion: apps/v1
kind: Deployment
metadata:
name: kreuzberg-api
spec:
replicas: 3
template:
spec:
containers:
- name: kreuzberg
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
env:
- name: KREUZBERG_MAX_MULTIPART_FIELD_BYTES
value: "524288000"
- name: KREUZBERG_CORS_ORIGINS
value: "https://myapp.com"
resources:
limits:
memory: "2Gi"
cpu: "2000m"
```
### 4. Programmatic Configuration
=== "C#"
```csharp
using Kreuzberg;
// Create limits: 50 MB for both request body and individual files
var limits = ApiSizeLimits.FromMB(50, 50);
// Or create with custom byte values
var customLimits = new ApiSizeLimits
{
MaxRequestBodyBytes = 100 * 1024 * 1024, // 100 MB
MaxMultipartFieldBytes = 100 * 1024 * 1024 // 100 MB
};
```
=== "Go"
```go
import "kreuzberg"
// Create limits: 200 MB for both request body and individual files
limits := kreuzberg.NewApiSizeLimits(
200 * 1024 * 1024, // max_request_body_bytes
200 * 1024 * 1024, // max_multipart_field_bytes
)
// Or use convenience method with MB values
limits := kreuzberg.ApiSizeLimitsFromMB(200, 200)
```
=== "Java"
```java
import com.kreuzberg.api.ApiSizeLimits;
// Create limits: 200 MB for both request body and individual files
ApiSizeLimits limits = new ApiSizeLimits(
200 * 1024 * 1024, // maxRequestBodyBytes
200 * 1024 * 1024 // maxMultipartFieldBytes
);
// Or use convenience method with MB values
ApiSizeLimits limits = ApiSizeLimits.fromMB(200, 200);
```
=== "Python"
```python
from kreuzberg.api import ApiSizeLimits, create_router_with_limits
from kreuzberg import ExtractionConfig
# Create limits: 200 MB for both request body and individual files
limits = ApiSizeLimits.from_mb(200, 200)
# Or create with custom byte values
limits = ApiSizeLimits(
max_request_body_bytes=200 * 1024 * 1024,
max_multipart_field_bytes=200 * 1024 * 1024
)
# Create router with custom limits
config = ExtractionConfig()
router = create_router_with_limits(config, limits)
```
=== "Ruby"
```ruby
require 'kreuzberg'
# Create limits: 200 MB for both request body and individual files
limits = Kreuzberg::Api::ApiSizeLimits.from_mb(200, 200)
# Or create with custom byte values
limits = Kreuzberg::Api::ApiSizeLimits.new(
max_request_body_bytes: 200 * 1024 * 1024,
max_multipart_field_bytes: 200 * 1024 * 1024
)
```
=== "Rust"
```rust
use kreuzberg::{ExtractionConfig, api::{create_router_with_limits, ApiSizeLimits}};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create limits: 200 MB for both request body and individual files
let limits = ApiSizeLimits::from_mb(200, 200);
// Or create with custom byte values
let limits = ApiSizeLimits::new(
200 * 1024 * 1024, // max_request_body_bytes
200 * 1024 * 1024, // max_multipart_field_bytes
);
let config = ExtractionConfig::default();
let router = create_router_with_limits(config, limits);
Ok(())
}
```
=== "TypeScript"
```typescript
import { ApiSizeLimits, createRouterWithLimits } from 'kreuzberg';
// Create limits: 200 MB for both request body and individual files
const limits = ApiSizeLimits.fromMb(200, 200);
// Or create with custom byte values
const limits = new ApiSizeLimits({
maxRequestBodyBytes: 200 * 1024 * 1024,
maxMultipartFieldBytes: 200 * 1024 * 1024
});
// Create router with custom limits
const router = createRouterWithLimits(config, limits);
```
## Configuration Scenarios
### Small Documents (Default)
For standard business documents and PDFs under 50 MB:
```bash title="Terminal"
# Use default 100 MB (no configuration needed)
kreuzberg serve -H 0.0.0.0 -p 8000
```
### Medium Documents
For typical scanned document batches and office files up to 200 MB:
```bash title="Terminal"
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=209715200
kreuzberg serve -H 0.0.0.0 -p 8000
```
### Large Scans and Archives
For high-resolution scans, video content, and large archives up to 1 GB:
```bash title="Terminal"
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=1048576000
kreuzberg serve -H 0.0.0.0 -p 8000
```
### Constrained Environments
For development environments or memory-limited servers:
```bash title="Terminal"
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=52428800
kreuzberg serve -H 0.0.0.0 -p 8000
```
## Performance Considerations
### Memory Usage
File size limits directly impact memory consumption:
- **Larger limits** require more RAM to buffer request bodies
- **Streaming extraction** processes files incrementally, reducing peak memory
- **Batch requests** with multiple files consume memory for all files simultaneously
#### Memory Impact Examples
| Upload Limit | Memory Impact | Recommended RAM |
| ---------------- | ------------------------ | --------------- |
| 50 MB | ~50-100 MB per request | 512 MB |
| 100 MB (default) | ~100-200 MB per request | 1 GB |
| 500 MB | ~500 MB-1 GB per request | 2-4 GB |
| 1000 MB | ~1-2 GB per request | 4-8 GB |
### Handling Large Files
When processing very large files (multi-GB):
1. **Allocate adequate RAM** - Use the memory impact table above as a guideline
2. **Increase timeouts** - Large files take longer to upload and process
3. **Monitor concurrency** - Limit concurrent uploads to prevent memory exhaustion
4. **Use streaming** - Where possible, process files streaming to reduce memory peaks
### Docker Memory Limits
Configure Docker resource limits appropriately:
```yaml title="docker-compose.yaml"
services:
kreuzberg-api:
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
environment:
KREUZBERG_MAX_MULTIPART_FIELD_BYTES: "524288000"
deploy:
resources:
limits:
memory: 4G # Limit container to 4 GB
cpus: "2" # Limit to 2 CPU cores
reservations:
memory: 2G # Reserve 2 GB minimum
cpus: "1" # Reserve 1 CPU core
```
### Reverse Proxy Configuration
When using a reverse proxy (Nginx, Caddy), ensure proxy limits match or exceed Kreuzberg's limits:
**Nginx:**
```nginx title="nginx.conf"
server {
listen 443 ssl http2;
server_name api.example.com;
# Match or exceed Kreuzberg's limit
client_max_body_size 500M;
location / {
proxy_pass http://kreuzberg;
# Extended timeouts for large file processing
proxy_read_timeout 300s;
proxy_send_timeout 300s;
proxy_request_buffering off; # Stream instead of buffer
}
}
```
**Caddy:**
```caddy title="Caddyfile"
api.example.com {
reverse_proxy localhost:8000 {
# Match Kreuzberg's limit
max_body_size 500MB
# Enable streaming for large files
flush_interval -1
}
}
```
## Error Handling
### Exceeding Limits
When a request exceeds configured limits, the server returns a 413 Payload Too Large error:
```bash title="Terminal"
# Try to upload a 500 MB file with 100 MB default limit
curl -F "files=@large_file_500mb.zip" http://localhost:8000/extract
# Response (HTTP 413)
HTTP/1.1 413 Payload Too Large
Content-Type: application/json
{
"error_type": "ValidationError",
"message": "Request body exceeds maximum allowed size",
"status_code": 413
}
```
### Client-Side Validation
Validate file sizes before upload to provide better user experience:
=== "Python"
```python
import os
from pathlib import Path
def validate_file_size(file_path: str, max_size_mb: int) -> bool:
"""Check if file size is within limits."""
file_size_bytes = os.path.getsize(file_path)
file_size_mb = file_size_bytes / (1024 * 1024)
if file_size_mb > max_size_mb:
print(f"File {Path(file_path).name} exceeds {max_size_mb} MB limit")
return False
return True
# Validate before upload
if validate_file_size("document.pdf", max_size_mb=100):
# Proceed with upload
pass
```
=== "TypeScript"
```typescript
function validateFileSize(file: File, maxSizeMB: number): boolean {
const fileSizeMB = file.size / (1024 * 1024);
if (fileSizeMB > maxSizeMB) {
console.error(`File ${file.name} exceeds ${maxSizeMB} MB limit`);
return false;
}
return true;
}
// Validate before upload
const fileInput = document.getElementById('fileInput') as HTMLInputElement;
fileInput.addEventListener('change', (e) => {
const file = (e.target as HTMLInputElement).files?.[0];
if (file && validateFileSize(file, 100)) {
// Proceed with upload
}
});
```
=== "Go"
```go
import "os"
import "fmt"
func validateFileSize(filePath string, maxSizeMB int64) bool {
fileInfo, err := os.Stat(filePath)
if err != nil {
return false
}
fileSizeMB := fileInfo.Size() / (1024 * 1024)
if fileSizeMB > maxSizeMB {
fmt.Printf("File exceeds %d MB limit\n", maxSizeMB)
return false
}
return true
}
// Validate before upload
if validateFileSize("document.pdf", 100) {
// Proceed with upload
}
```
## Troubleshooting
### "Request body exceeds maximum allowed size"
**Problem:** Upload fails with HTTP 413 error
**Solutions:**
1. **Increase limit:**
```bash
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=524288000
```
2. **Check reverse proxy limits:**
```nginx
# Nginx: ensure client_max_body_size matches or exceeds Kreuzberg limit
client_max_body_size 500M;
```
3. **Validate file size before upload:**
```bash
# Check actual file size
ls -lh document.pdf
```
### Server crashes with large files
**Problem:** Memory exhaustion when processing large files
**Solutions:**
1. **Increase container memory:**
```yaml
deploy:
resources:
limits:
memory: 4G
```
2. **Reduce upload limit:**
```bash
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=209715200
```
3. **Process files sequentially:**
- Send one file per request instead of batch uploads
- Implement request queuing at the application level
4. **Monitor memory usage:**
```bash
# Docker
docker stats kreuzberg-api
# Kubernetes
kubectl top pod kreuzberg-api-xxxxx
```
### Slow uploads
**Problem:** Large file uploads timeout
**Solutions:**
1. **Increase reverse proxy timeouts:**
```nginx
proxy_read_timeout 600s; # 10 minutes
proxy_send_timeout 600s;
```
2. **Enable streaming:**
```nginx
proxy_request_buffering off;
```
3. **Check network bandwidth:**
- For a 500 MB file over a 10 Mbps connection: 500 MB × 8 bits/byte ÷ 10 Mbps = ~400 seconds
## Best Practices
1. **Match limits to use case:** Set limits based on your actual file sizes, not theoretical maximums
2. **Monitor and adjust:** Track actual file sizes and adjust limits quarterly
3. **Use reverse proxy buffering:** Configure reverse proxies to handle buffering, not Kreuzberg
4. **Implement client-side validation:** Validate file sizes before sending to server
5. **Plan for scaling:** Run multiple Kreuzberg instances behind a load balancer for high-throughput scenarios
6. **Set appropriate timeouts:** Increase timeouts for large files (5-10 minutes recommended)
7. **Document your limits:** Keep configuration in version control with clear documentation
8. **Test with real files:** Test with actual document types you'll process in production
9. **Monitor disk space:** Large files consume both RAM and disk (if streaming to disk)
10. **Consider compression:** If applicable, compress large document batches before upload
## See Also
- [Configuration Guide](../guides/configuration.md) - Extraction configuration options
- [API Server Guide](../guides/api-server.md) - Complete API server documentation
- [Docker Deployment](../guides/docker.md) - Docker setup and configuration
- [Performance Tuning](../guides/api-server.md) - Advanced performance optimization