This commit is contained in:
565
docs/reference/file-size-limits.md
Normal file
565
docs/reference/file-size-limits.md
Normal file
@@ -0,0 +1,565 @@
|
||||
# File Size Limits
|
||||
|
||||
Kreuzberg enforces size limits on file uploads and API requests to manage server resources effectively. This page documents the default limits, how to configure them, and recommendations for optimal performance.
|
||||
|
||||
## Overview
|
||||
|
||||
File size limits protect your server from resource exhaustion and unexpected memory spikes. The Kreuzberg API implements two complementary limit types:
|
||||
|
||||
| Limit Type | Purpose | Default |
|
||||
| ------------------------- | ------------------------------------------- | ------- |
|
||||
| **Request Body Limit** | Total size of all files in a single request | 100 MB |
|
||||
| **Multipart Field Limit** | Maximum size of an individual file | 100 MB |
|
||||
|
||||
Both limits are configurable via environment variables (`KREUZBERG_MAX_REQUEST_BODY_BYTES`, `KREUZBERG_MAX_MULTIPART_FIELD_BYTES`) or programmatically via the `ApiSizeLimits` type.
|
||||
|
||||
## Default Configuration
|
||||
|
||||
### Default Limits: 100 MB
|
||||
|
||||
The default configuration allows:
|
||||
|
||||
- **Total request body:** 100 MB (104,857,600 bytes)
|
||||
- **Individual file:** 100 MB (104,857,600 bytes)
|
||||
|
||||
These defaults are suitable for typical document processing workloads including:
|
||||
|
||||
- Standard PDF documents and scanned pages
|
||||
- Office documents (Word, Excel, PowerPoint)
|
||||
- High-resolution images
|
||||
- Single document uploads and small batches
|
||||
|
||||
### When to Increase
|
||||
|
||||
Increase limits to process:
|
||||
|
||||
- **Large scanned document archives** (200+ MB)
|
||||
- **High-resolution images** (50+ MB each)
|
||||
- **Video presentations** (500+ MB)
|
||||
- **Bulk batch uploads** (multiple 50 MB documents)
|
||||
|
||||
### When to Decrease
|
||||
|
||||
Decrease limits if:
|
||||
|
||||
- You want to enforce strict file size policies
|
||||
- Your server has limited memory
|
||||
- You're processing only small structured documents
|
||||
- You need to rate-limit aggressive clients
|
||||
|
||||
## Configuration Methods
|
||||
|
||||
### 1. Environment Variable (Simplest)
|
||||
|
||||
Set the `KREUZBERG_MAX_MULTIPART_FIELD_BYTES` environment variable to specify the max multipart field size in bytes:
|
||||
|
||||
```bash title="Terminal"
|
||||
# Set to 200 MB
|
||||
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=209715200
|
||||
kreuzberg serve -H 0.0.0.0 -p 8000
|
||||
|
||||
# Set to 500 MB for large documents
|
||||
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=524288000
|
||||
kreuzberg serve -H 0.0.0.0 -p 8000
|
||||
```
|
||||
|
||||
### 2. Docker Compose
|
||||
|
||||
Configure limits in your Docker Compose setup:
|
||||
|
||||
```yaml title="docker-compose.yaml"
|
||||
version: "3.8"
|
||||
services:
|
||||
kreuzberg-api:
|
||||
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
ports:
|
||||
- "8000:8000"
|
||||
environment:
|
||||
# Set maximum multipart field size to 500 MB
|
||||
KREUZBERG_MAX_MULTIPART_FIELD_BYTES: "524288000"
|
||||
# Configure CORS for production
|
||||
KREUZBERG_CORS_ORIGINS: "https://myapp.com,https://api.myapp.com"
|
||||
volumes:
|
||||
- ./cache:/app/.kreuzberg
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
```
|
||||
|
||||
### 3. Kubernetes Deployment
|
||||
|
||||
Configure size limits in your Kubernetes deployment:
|
||||
|
||||
```yaml title="kubernetes-deployment.yaml"
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: kreuzberg-api
|
||||
spec:
|
||||
replicas: 3
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: kreuzberg
|
||||
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
env:
|
||||
- name: KREUZBERG_MAX_MULTIPART_FIELD_BYTES
|
||||
value: "524288000"
|
||||
- name: KREUZBERG_CORS_ORIGINS
|
||||
value: "https://myapp.com"
|
||||
resources:
|
||||
limits:
|
||||
memory: "2Gi"
|
||||
cpu: "2000m"
|
||||
```
|
||||
|
||||
### 4. Programmatic Configuration
|
||||
|
||||
=== "C#"
|
||||
|
||||
```csharp
|
||||
using Kreuzberg;
|
||||
|
||||
// Create limits: 50 MB for both request body and individual files
|
||||
var limits = ApiSizeLimits.FromMB(50, 50);
|
||||
|
||||
// Or create with custom byte values
|
||||
var customLimits = new ApiSizeLimits
|
||||
{
|
||||
MaxRequestBodyBytes = 100 * 1024 * 1024, // 100 MB
|
||||
MaxMultipartFieldBytes = 100 * 1024 * 1024 // 100 MB
|
||||
};
|
||||
```
|
||||
|
||||
=== "Go"
|
||||
|
||||
```go
|
||||
import "kreuzberg"
|
||||
|
||||
// Create limits: 200 MB for both request body and individual files
|
||||
limits := kreuzberg.NewApiSizeLimits(
|
||||
200 * 1024 * 1024, // max_request_body_bytes
|
||||
200 * 1024 * 1024, // max_multipart_field_bytes
|
||||
)
|
||||
|
||||
// Or use convenience method with MB values
|
||||
limits := kreuzberg.ApiSizeLimitsFromMB(200, 200)
|
||||
```
|
||||
|
||||
=== "Java"
|
||||
|
||||
```java
|
||||
import com.kreuzberg.api.ApiSizeLimits;
|
||||
|
||||
// Create limits: 200 MB for both request body and individual files
|
||||
ApiSizeLimits limits = new ApiSizeLimits(
|
||||
200 * 1024 * 1024, // maxRequestBodyBytes
|
||||
200 * 1024 * 1024 // maxMultipartFieldBytes
|
||||
);
|
||||
|
||||
// Or use convenience method with MB values
|
||||
ApiSizeLimits limits = ApiSizeLimits.fromMB(200, 200);
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python
|
||||
from kreuzberg.api import ApiSizeLimits, create_router_with_limits
|
||||
from kreuzberg import ExtractionConfig
|
||||
|
||||
# Create limits: 200 MB for both request body and individual files
|
||||
limits = ApiSizeLimits.from_mb(200, 200)
|
||||
|
||||
# Or create with custom byte values
|
||||
limits = ApiSizeLimits(
|
||||
max_request_body_bytes=200 * 1024 * 1024,
|
||||
max_multipart_field_bytes=200 * 1024 * 1024
|
||||
)
|
||||
|
||||
# Create router with custom limits
|
||||
config = ExtractionConfig()
|
||||
router = create_router_with_limits(config, limits)
|
||||
```
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
# Create limits: 200 MB for both request body and individual files
|
||||
limits = Kreuzberg::Api::ApiSizeLimits.from_mb(200, 200)
|
||||
|
||||
# Or create with custom byte values
|
||||
limits = Kreuzberg::Api::ApiSizeLimits.new(
|
||||
max_request_body_bytes: 200 * 1024 * 1024,
|
||||
max_multipart_field_bytes: 200 * 1024 * 1024
|
||||
)
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust
|
||||
use kreuzberg::{ExtractionConfig, api::{create_router_with_limits, ApiSizeLimits}};
|
||||
|
||||
#[tokio::main]
|
||||
async fn main() -> Result<(), Box<dyn std::error::Error>> {
|
||||
// Create limits: 200 MB for both request body and individual files
|
||||
let limits = ApiSizeLimits::from_mb(200, 200);
|
||||
|
||||
// Or create with custom byte values
|
||||
let limits = ApiSizeLimits::new(
|
||||
200 * 1024 * 1024, // max_request_body_bytes
|
||||
200 * 1024 * 1024, // max_multipart_field_bytes
|
||||
);
|
||||
|
||||
let config = ExtractionConfig::default();
|
||||
let router = create_router_with_limits(config, limits);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript
|
||||
import { ApiSizeLimits, createRouterWithLimits } from 'kreuzberg';
|
||||
|
||||
// Create limits: 200 MB for both request body and individual files
|
||||
const limits = ApiSizeLimits.fromMb(200, 200);
|
||||
|
||||
// Or create with custom byte values
|
||||
const limits = new ApiSizeLimits({
|
||||
maxRequestBodyBytes: 200 * 1024 * 1024,
|
||||
maxMultipartFieldBytes: 200 * 1024 * 1024
|
||||
});
|
||||
|
||||
// Create router with custom limits
|
||||
const router = createRouterWithLimits(config, limits);
|
||||
```
|
||||
|
||||
## Configuration Scenarios
|
||||
|
||||
### Small Documents (Default)
|
||||
|
||||
For standard business documents and PDFs under 50 MB:
|
||||
|
||||
```bash title="Terminal"
|
||||
# Use default 100 MB (no configuration needed)
|
||||
kreuzberg serve -H 0.0.0.0 -p 8000
|
||||
```
|
||||
|
||||
### Medium Documents
|
||||
|
||||
For typical scanned document batches and office files up to 200 MB:
|
||||
|
||||
```bash title="Terminal"
|
||||
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=209715200
|
||||
kreuzberg serve -H 0.0.0.0 -p 8000
|
||||
```
|
||||
|
||||
### Large Scans and Archives
|
||||
|
||||
For high-resolution scans, video content, and large archives up to 1 GB:
|
||||
|
||||
```bash title="Terminal"
|
||||
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=1048576000
|
||||
kreuzberg serve -H 0.0.0.0 -p 8000
|
||||
```
|
||||
|
||||
### Constrained Environments
|
||||
|
||||
For development environments or memory-limited servers:
|
||||
|
||||
```bash title="Terminal"
|
||||
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=52428800
|
||||
kreuzberg serve -H 0.0.0.0 -p 8000
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Memory Usage
|
||||
|
||||
File size limits directly impact memory consumption:
|
||||
|
||||
- **Larger limits** require more RAM to buffer request bodies
|
||||
- **Streaming extraction** processes files incrementally, reducing peak memory
|
||||
- **Batch requests** with multiple files consume memory for all files simultaneously
|
||||
|
||||
#### Memory Impact Examples
|
||||
|
||||
| Upload Limit | Memory Impact | Recommended RAM |
|
||||
| ---------------- | ------------------------ | --------------- |
|
||||
| 50 MB | ~50-100 MB per request | 512 MB |
|
||||
| 100 MB (default) | ~100-200 MB per request | 1 GB |
|
||||
| 500 MB | ~500 MB-1 GB per request | 2-4 GB |
|
||||
| 1000 MB | ~1-2 GB per request | 4-8 GB |
|
||||
|
||||
### Handling Large Files
|
||||
|
||||
When processing very large files (multi-GB):
|
||||
|
||||
1. **Allocate adequate RAM** - Use the memory impact table above as a guideline
|
||||
2. **Increase timeouts** - Large files take longer to upload and process
|
||||
3. **Monitor concurrency** - Limit concurrent uploads to prevent memory exhaustion
|
||||
4. **Use streaming** - Where possible, process files streaming to reduce memory peaks
|
||||
|
||||
### Docker Memory Limits
|
||||
|
||||
Configure Docker resource limits appropriately:
|
||||
|
||||
```yaml title="docker-compose.yaml"
|
||||
services:
|
||||
kreuzberg-api:
|
||||
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
environment:
|
||||
KREUZBERG_MAX_MULTIPART_FIELD_BYTES: "524288000"
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 4G # Limit container to 4 GB
|
||||
cpus: "2" # Limit to 2 CPU cores
|
||||
reservations:
|
||||
memory: 2G # Reserve 2 GB minimum
|
||||
cpus: "1" # Reserve 1 CPU core
|
||||
```
|
||||
|
||||
### Reverse Proxy Configuration
|
||||
|
||||
When using a reverse proxy (Nginx, Caddy), ensure proxy limits match or exceed Kreuzberg's limits:
|
||||
|
||||
**Nginx:**
|
||||
|
||||
```nginx title="nginx.conf"
|
||||
server {
|
||||
listen 443 ssl http2;
|
||||
server_name api.example.com;
|
||||
|
||||
# Match or exceed Kreuzberg's limit
|
||||
client_max_body_size 500M;
|
||||
|
||||
location / {
|
||||
proxy_pass http://kreuzberg;
|
||||
# Extended timeouts for large file processing
|
||||
proxy_read_timeout 300s;
|
||||
proxy_send_timeout 300s;
|
||||
proxy_request_buffering off; # Stream instead of buffer
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Caddy:**
|
||||
|
||||
```caddy title="Caddyfile"
|
||||
api.example.com {
|
||||
reverse_proxy localhost:8000 {
|
||||
# Match Kreuzberg's limit
|
||||
max_body_size 500MB
|
||||
# Enable streaming for large files
|
||||
flush_interval -1
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Exceeding Limits
|
||||
|
||||
When a request exceeds configured limits, the server returns a 413 Payload Too Large error:
|
||||
|
||||
```bash title="Terminal"
|
||||
# Try to upload a 500 MB file with 100 MB default limit
|
||||
curl -F "files=@large_file_500mb.zip" http://localhost:8000/extract
|
||||
|
||||
# Response (HTTP 413)
|
||||
HTTP/1.1 413 Payload Too Large
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"error_type": "ValidationError",
|
||||
"message": "Request body exceeds maximum allowed size",
|
||||
"status_code": 413
|
||||
}
|
||||
```
|
||||
|
||||
### Client-Side Validation
|
||||
|
||||
Validate file sizes before upload to provide better user experience:
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
def validate_file_size(file_path: str, max_size_mb: int) -> bool:
|
||||
"""Check if file size is within limits."""
|
||||
file_size_bytes = os.path.getsize(file_path)
|
||||
file_size_mb = file_size_bytes / (1024 * 1024)
|
||||
|
||||
if file_size_mb > max_size_mb:
|
||||
print(f"File {Path(file_path).name} exceeds {max_size_mb} MB limit")
|
||||
return False
|
||||
return True
|
||||
|
||||
# Validate before upload
|
||||
if validate_file_size("document.pdf", max_size_mb=100):
|
||||
# Proceed with upload
|
||||
pass
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript
|
||||
function validateFileSize(file: File, maxSizeMB: number): boolean {
|
||||
const fileSizeMB = file.size / (1024 * 1024);
|
||||
|
||||
if (fileSizeMB > maxSizeMB) {
|
||||
console.error(`File ${file.name} exceeds ${maxSizeMB} MB limit`);
|
||||
return false;
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
// Validate before upload
|
||||
const fileInput = document.getElementById('fileInput') as HTMLInputElement;
|
||||
fileInput.addEventListener('change', (e) => {
|
||||
const file = (e.target as HTMLInputElement).files?.[0];
|
||||
if (file && validateFileSize(file, 100)) {
|
||||
// Proceed with upload
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
=== "Go"
|
||||
|
||||
```go
|
||||
import "os"
|
||||
import "fmt"
|
||||
|
||||
func validateFileSize(filePath string, maxSizeMB int64) bool {
|
||||
fileInfo, err := os.Stat(filePath)
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
|
||||
fileSizeMB := fileInfo.Size() / (1024 * 1024)
|
||||
if fileSizeMB > maxSizeMB {
|
||||
fmt.Printf("File exceeds %d MB limit\n", maxSizeMB)
|
||||
return false
|
||||
}
|
||||
return true
|
||||
}
|
||||
|
||||
// Validate before upload
|
||||
if validateFileSize("document.pdf", 100) {
|
||||
// Proceed with upload
|
||||
}
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Request body exceeds maximum allowed size"
|
||||
|
||||
**Problem:** Upload fails with HTTP 413 error
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. **Increase limit:**
|
||||
|
||||
```bash
|
||||
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=524288000
|
||||
```
|
||||
|
||||
2. **Check reverse proxy limits:**
|
||||
|
||||
```nginx
|
||||
# Nginx: ensure client_max_body_size matches or exceeds Kreuzberg limit
|
||||
client_max_body_size 500M;
|
||||
```
|
||||
|
||||
3. **Validate file size before upload:**
|
||||
|
||||
```bash
|
||||
# Check actual file size
|
||||
ls -lh document.pdf
|
||||
```
|
||||
|
||||
### Server crashes with large files
|
||||
|
||||
**Problem:** Memory exhaustion when processing large files
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. **Increase container memory:**
|
||||
|
||||
```yaml
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 4G
|
||||
```
|
||||
|
||||
2. **Reduce upload limit:**
|
||||
|
||||
```bash
|
||||
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=209715200
|
||||
```
|
||||
|
||||
3. **Process files sequentially:**
|
||||
- Send one file per request instead of batch uploads
|
||||
- Implement request queuing at the application level
|
||||
|
||||
4. **Monitor memory usage:**
|
||||
|
||||
```bash
|
||||
# Docker
|
||||
docker stats kreuzberg-api
|
||||
|
||||
# Kubernetes
|
||||
kubectl top pod kreuzberg-api-xxxxx
|
||||
```
|
||||
|
||||
### Slow uploads
|
||||
|
||||
**Problem:** Large file uploads timeout
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. **Increase reverse proxy timeouts:**
|
||||
|
||||
```nginx
|
||||
proxy_read_timeout 600s; # 10 minutes
|
||||
proxy_send_timeout 600s;
|
||||
```
|
||||
|
||||
2. **Enable streaming:**
|
||||
|
||||
```nginx
|
||||
proxy_request_buffering off;
|
||||
```
|
||||
|
||||
3. **Check network bandwidth:**
|
||||
- For a 500 MB file over a 10 Mbps connection: 500 MB × 8 bits/byte ÷ 10 Mbps = ~400 seconds
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Match limits to use case:** Set limits based on your actual file sizes, not theoretical maximums
|
||||
2. **Monitor and adjust:** Track actual file sizes and adjust limits quarterly
|
||||
3. **Use reverse proxy buffering:** Configure reverse proxies to handle buffering, not Kreuzberg
|
||||
4. **Implement client-side validation:** Validate file sizes before sending to server
|
||||
5. **Plan for scaling:** Run multiple Kreuzberg instances behind a load balancer for high-throughput scenarios
|
||||
6. **Set appropriate timeouts:** Increase timeouts for large files (5-10 minutes recommended)
|
||||
7. **Document your limits:** Keep configuration in version control with clear documentation
|
||||
8. **Test with real files:** Test with actual document types you'll process in production
|
||||
9. **Monitor disk space:** Large files consume both RAM and disk (if streaming to disk)
|
||||
10. **Consider compression:** If applicable, compress large document batches before upload
|
||||
|
||||
## See Also
|
||||
|
||||
- [Configuration Guide](../guides/configuration.md) - Extraction configuration options
|
||||
- [API Server Guide](../guides/api-server.md) - Complete API server documentation
|
||||
- [Docker Deployment](../guides/docker.md) - Docker setup and configuration
|
||||
- [Performance Tuning](../guides/api-server.md) - Advanced performance optimization
|
||||
Reference in New Issue
Block a user