Files
fil/docs/guides/code-intelligence.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

250 lines
9.4 KiB
Markdown

# Code Intelligence
Kreuzberg integrates [tree-sitter-language-pack](https://docs.tree-sitter-language-pack.kreuzberg.dev) (TSLP) to parse and analyze source code files. When you extract a source code file, Kreuzberg automatically detects the programming language and produces structured analysis alongside the raw text content.
## What You Get
When extracting source code, the `metadata.format` field contains a `ProcessResult` (format type `"code"`) with:
- **Structure** -- functions, classes, structs, methods, modules, and their nesting hierarchy
- **Imports** -- import/include/require statements with source paths and imported items
- **Exports** -- exported symbols with their kinds (function, class, variable, type, default)
- **Comments** -- inline and block comments with their positions
- **Docstrings** -- documentation comments with parsed sections (params, returns, etc.)
- **Symbols** -- variable, constant, and type alias definitions
- **Diagnostics** -- parse errors and warnings from tree-sitter
- **Chunks** -- semantically meaningful code chunks for RAG and embedding pipelines
- **Metrics** -- file-level statistics (lines of code, comment lines, empty lines, node count)
Language support covers **300+ programming languages** via tree-sitter grammars. See the [TSLP documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) for the full language list.
## Getting Started
Code intelligence is enabled by default when the `tree-sitter` feature flag is active. Simply extract a source code file:
=== "Rust"
```rust title="basic.rs"
use kreuzberg::{extract_file_sync, ExtractionConfig};
let config = ExtractionConfig::default();
let result = extract_file_sync("app.py", None, &config)?;
// The content field has the raw source text
println!("{}", result.content);
// Code intelligence is in metadata.format
if let Some(kreuzberg::types::FormatMetadata::Code(ref code)) = result.metadata.format {
println!("Language: {}", code.language);
println!("Structures: {}", code.structure.len());
println!("Imports: {}", code.imports.len());
}
```
=== "Python"
```python title="basic.py"
import kreuzberg
config = kreuzberg.ExtractionConfig()
result = kreuzberg.extract_file_sync("app.py", config=config)
# The content field has the raw source text
print(result.content)
# Code intelligence is in metadata["format"]
fmt = result.metadata.get("format")
if fmt and fmt.get("format_type") == "code":
print(f"Language: {fmt['language']}")
print(f"Structures: {len(fmt['structure'])}")
print(f"Imports: {len(fmt['imports'])}")
```
=== "TypeScript"
```typescript title="basic.ts"
import { extractFileSync } from "@kreuzberg/node";
const result = extractFileSync("app.ts");
console.log(result.content);
const fmt = result.metadata?.format;
if (fmt?.formatType === "code") {
console.log(`Language: ${fmt.language}`);
console.log(`Structures: ${fmt.structure.length}`);
console.log(`Imports: ${fmt.imports.length}`);
}
```
=== "Go"
```go title="basic.go"
result, err := kreuzberg.ExtractFileSync("app.py", nil)
if err != nil {
log.Fatal(err)
}
fmt.Println(result.Content)
// Code intelligence is available in result.Metadata.Format
// when Format.Type == "code"
```
## Configuration
Use `TreeSitterConfig` to control which analysis features are enabled. Set `enabled: false` to disable code intelligence entirely. By default, `structure`, `imports`, and `exports` are enabled; `comments`, `docstrings`, `symbols`, and `diagnostics` are disabled.
=== "Rust"
```rust title="config.rs"
use kreuzberg::{ExtractionConfig, TreeSitterConfig, TreeSitterProcessConfig};
let config = ExtractionConfig {
tree_sitter: Some(TreeSitterConfig {
process: TreeSitterProcessConfig {
structure: true, // functions, classes, etc. (default: true)
imports: true, // import statements (default: true)
exports: true, // export statements (default: true)
comments: true, // comments (default: false)
docstrings: true, // docstrings (default: false)
symbols: true, // variables, constants (default: false)
diagnostics: true, // parse errors/warnings (default: false)
chunk_max_size: Some(4096), // max chunk size in bytes
..Default::default()
},
..Default::default()
}),
..Default::default()
};
```
=== "Python"
```python title="config.py"
import kreuzberg
config = kreuzberg.ExtractionConfig(
tree_sitter={
"process": {
"structure": True,
"imports": True,
"exports": True,
"comments": True,
"docstrings": True,
"symbols": True,
"diagnostics": True,
"chunk_max_size": 4096,
}
}
)
```
=== "TypeScript"
```typescript title="config.ts"
import { ExtractionConfig } from "@kreuzberg/node";
const config: ExtractionConfig = {
treeSitter: {
process: {
structure: true,
imports: true,
exports: true,
comments: true,
docstrings: true,
symbols: true,
diagnostics: true,
chunkMaxSize: 4096,
},
},
};
```
=== "TOML"
```toml title="kreuzberg.toml"
[tree_sitter.process]
structure = true
imports = true
exports = true
comments = true
docstrings = true
symbols = true
diagnostics = true
chunk_max_size = 4096
```
### Configuration Fields
See [`TreeSitterConfig`](../reference/configuration.md#treesitterconfig) and [`TreeSitterProcessConfig`](../reference/configuration.md#treesitterprocessconfig) for all fields.
## ProcessResult Fields
Code intelligence results are returned as a `ProcessResult` from the upstream [`tree-sitter-language-pack`](https://docs.rs/tree-sitter-language-pack) crate. Top-level fields: `language`, `metrics`, `structure`, `imports`, `exports`, `chunks`, plus `comments` / `docstrings` / `symbols` / `diagnostics` (populated only when their `TreeSitterProcessConfig` flag is on). See the upstream crate docs for full field shapes.
## Semantic Chunking for RAG
Code chunks produced by tree-sitter are semantically aware -- they split at function, class, and module boundaries rather than fixed line counts. This makes them ideal for retrieval-augmented generation (RAG) pipelines:
```python title="rag_chunking.py"
import kreuzberg
config = kreuzberg.ExtractionConfig(
tree_sitter={"process": {"chunk_max_size": 2048}}
)
result = kreuzberg.extract_file_sync("large_module.py", config=config)
fmt = result.metadata.get("format")
if fmt and fmt.get("format_type") == "code":
for chunk in fmt.get("chunks", []):
# Each chunk is a semantically coherent piece of code
embedding = your_embedding_model(chunk["content"])
store_in_vector_db(
text=chunk["content"],
embedding=embedding,
metadata={
"language": chunk["language"],
"start_line": chunk["span"]["start_line"],
"parent": chunk.get("context", {}).get("parent_name"),
},
)
```
## Language Detection
Kreuzberg detects the programming language in two ways:
1. **File extension** (fast path) -- when using `extract_file`, the extension is matched against 248 known language extensions
2. **Shebang line** (fallback) -- when using `extract_bytes` or when the extension is ambiguous, the first line is checked for `#!/usr/bin/env python`, `#!/bin/bash`, and so on.
If neither method identifies the language, extraction returns an `UnsupportedFormat` error.
## Language Support
Tree-sitter-language-pack supports 300+ programming languages. For the full list, see the [TSLP language reference](https://docs.tree-sitter-language-pack.kreuzberg.dev).
Common languages with full structural analysis:
| Language | Structure | Imports | Exports | Docstrings |
| ---------- | --------- | ------- | ------- | ---------- |
| Python | Yes | Yes | Yes | Yes |
| Rust | Yes | Yes | Yes | Yes |
| TypeScript | Yes | Yes | Yes | Yes |
| JavaScript | Yes | Yes | Yes | Yes |
| Go | Yes | Yes | Yes | Yes |
| Java | Yes | Yes | Yes | Yes |
| C/C++ | Yes | Yes | Yes | Yes |
| Ruby | Yes | Yes | Yes | Yes |
| PHP | Yes | Yes | Yes | Yes |
| C# | Yes | Yes | Yes | Yes |
| Swift | Yes | Yes | Yes | Yes |
| Kotlin | Yes | Yes | Yes | Yes |
| Elixir | Yes | Yes | Yes | Yes |
## Related Documentation
- [Configuration Reference](../reference/configuration.md#treesitterconfig) -- TreeSitterConfig and TreeSitterProcessConfig fields
- [Types Reference](../reference/types.md) -- ProcessResult, StructureItem, CodeChunk, and related type definitions
- [tree-sitter-language-pack documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) -- Full language support reference