250 lines
9.4 KiB
Markdown
250 lines
9.4 KiB
Markdown
# Code Intelligence
|
|
|
|
Kreuzberg integrates [tree-sitter-language-pack](https://docs.tree-sitter-language-pack.kreuzberg.dev) (TSLP) to parse and analyze source code files. When you extract a source code file, Kreuzberg automatically detects the programming language and produces structured analysis alongside the raw text content.
|
|
|
|
## What You Get
|
|
|
|
When extracting source code, the `metadata.format` field contains a `ProcessResult` (format type `"code"`) with:
|
|
|
|
- **Structure** -- functions, classes, structs, methods, modules, and their nesting hierarchy
|
|
- **Imports** -- import/include/require statements with source paths and imported items
|
|
- **Exports** -- exported symbols with their kinds (function, class, variable, type, default)
|
|
- **Comments** -- inline and block comments with their positions
|
|
- **Docstrings** -- documentation comments with parsed sections (params, returns, etc.)
|
|
- **Symbols** -- variable, constant, and type alias definitions
|
|
- **Diagnostics** -- parse errors and warnings from tree-sitter
|
|
- **Chunks** -- semantically meaningful code chunks for RAG and embedding pipelines
|
|
- **Metrics** -- file-level statistics (lines of code, comment lines, empty lines, node count)
|
|
|
|
Language support covers **300+ programming languages** via tree-sitter grammars. See the [TSLP documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) for the full language list.
|
|
|
|
## Getting Started
|
|
|
|
Code intelligence is enabled by default when the `tree-sitter` feature flag is active. Simply extract a source code file:
|
|
|
|
=== "Rust"
|
|
|
|
```rust title="basic.rs"
|
|
use kreuzberg::{extract_file_sync, ExtractionConfig};
|
|
|
|
let config = ExtractionConfig::default();
|
|
let result = extract_file_sync("app.py", None, &config)?;
|
|
|
|
// The content field has the raw source text
|
|
println!("{}", result.content);
|
|
|
|
// Code intelligence is in metadata.format
|
|
if let Some(kreuzberg::types::FormatMetadata::Code(ref code)) = result.metadata.format {
|
|
println!("Language: {}", code.language);
|
|
println!("Structures: {}", code.structure.len());
|
|
println!("Imports: {}", code.imports.len());
|
|
}
|
|
```
|
|
|
|
=== "Python"
|
|
|
|
```python title="basic.py"
|
|
import kreuzberg
|
|
|
|
config = kreuzberg.ExtractionConfig()
|
|
result = kreuzberg.extract_file_sync("app.py", config=config)
|
|
|
|
# The content field has the raw source text
|
|
print(result.content)
|
|
|
|
# Code intelligence is in metadata["format"]
|
|
fmt = result.metadata.get("format")
|
|
if fmt and fmt.get("format_type") == "code":
|
|
print(f"Language: {fmt['language']}")
|
|
print(f"Structures: {len(fmt['structure'])}")
|
|
print(f"Imports: {len(fmt['imports'])}")
|
|
```
|
|
|
|
=== "TypeScript"
|
|
|
|
```typescript title="basic.ts"
|
|
import { extractFileSync } from "@kreuzberg/node";
|
|
|
|
const result = extractFileSync("app.ts");
|
|
|
|
console.log(result.content);
|
|
|
|
const fmt = result.metadata?.format;
|
|
if (fmt?.formatType === "code") {
|
|
console.log(`Language: ${fmt.language}`);
|
|
console.log(`Structures: ${fmt.structure.length}`);
|
|
console.log(`Imports: ${fmt.imports.length}`);
|
|
}
|
|
```
|
|
|
|
=== "Go"
|
|
|
|
```go title="basic.go"
|
|
result, err := kreuzberg.ExtractFileSync("app.py", nil)
|
|
if err != nil {
|
|
log.Fatal(err)
|
|
}
|
|
|
|
fmt.Println(result.Content)
|
|
// Code intelligence is available in result.Metadata.Format
|
|
// when Format.Type == "code"
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Use `TreeSitterConfig` to control which analysis features are enabled. Set `enabled: false` to disable code intelligence entirely. By default, `structure`, `imports`, and `exports` are enabled; `comments`, `docstrings`, `symbols`, and `diagnostics` are disabled.
|
|
|
|
=== "Rust"
|
|
|
|
```rust title="config.rs"
|
|
use kreuzberg::{ExtractionConfig, TreeSitterConfig, TreeSitterProcessConfig};
|
|
|
|
let config = ExtractionConfig {
|
|
tree_sitter: Some(TreeSitterConfig {
|
|
process: TreeSitterProcessConfig {
|
|
structure: true, // functions, classes, etc. (default: true)
|
|
imports: true, // import statements (default: true)
|
|
exports: true, // export statements (default: true)
|
|
comments: true, // comments (default: false)
|
|
docstrings: true, // docstrings (default: false)
|
|
symbols: true, // variables, constants (default: false)
|
|
diagnostics: true, // parse errors/warnings (default: false)
|
|
chunk_max_size: Some(4096), // max chunk size in bytes
|
|
..Default::default()
|
|
},
|
|
..Default::default()
|
|
}),
|
|
..Default::default()
|
|
};
|
|
```
|
|
|
|
=== "Python"
|
|
|
|
```python title="config.py"
|
|
import kreuzberg
|
|
|
|
config = kreuzberg.ExtractionConfig(
|
|
tree_sitter={
|
|
"process": {
|
|
"structure": True,
|
|
"imports": True,
|
|
"exports": True,
|
|
"comments": True,
|
|
"docstrings": True,
|
|
"symbols": True,
|
|
"diagnostics": True,
|
|
"chunk_max_size": 4096,
|
|
}
|
|
}
|
|
)
|
|
```
|
|
|
|
=== "TypeScript"
|
|
|
|
```typescript title="config.ts"
|
|
import { ExtractionConfig } from "@kreuzberg/node";
|
|
|
|
const config: ExtractionConfig = {
|
|
treeSitter: {
|
|
process: {
|
|
structure: true,
|
|
imports: true,
|
|
exports: true,
|
|
comments: true,
|
|
docstrings: true,
|
|
symbols: true,
|
|
diagnostics: true,
|
|
chunkMaxSize: 4096,
|
|
},
|
|
},
|
|
};
|
|
```
|
|
|
|
=== "TOML"
|
|
|
|
```toml title="kreuzberg.toml"
|
|
[tree_sitter.process]
|
|
structure = true
|
|
imports = true
|
|
exports = true
|
|
comments = true
|
|
docstrings = true
|
|
symbols = true
|
|
diagnostics = true
|
|
chunk_max_size = 4096
|
|
```
|
|
|
|
### Configuration Fields
|
|
|
|
See [`TreeSitterConfig`](../reference/configuration.md#treesitterconfig) and [`TreeSitterProcessConfig`](../reference/configuration.md#treesitterprocessconfig) for all fields.
|
|
|
|
## ProcessResult Fields
|
|
|
|
Code intelligence results are returned as a `ProcessResult` from the upstream [`tree-sitter-language-pack`](https://docs.rs/tree-sitter-language-pack) crate. Top-level fields: `language`, `metrics`, `structure`, `imports`, `exports`, `chunks`, plus `comments` / `docstrings` / `symbols` / `diagnostics` (populated only when their `TreeSitterProcessConfig` flag is on). See the upstream crate docs for full field shapes.
|
|
|
|
## Semantic Chunking for RAG
|
|
|
|
Code chunks produced by tree-sitter are semantically aware -- they split at function, class, and module boundaries rather than fixed line counts. This makes them ideal for retrieval-augmented generation (RAG) pipelines:
|
|
|
|
```python title="rag_chunking.py"
|
|
import kreuzberg
|
|
|
|
config = kreuzberg.ExtractionConfig(
|
|
tree_sitter={"process": {"chunk_max_size": 2048}}
|
|
)
|
|
|
|
result = kreuzberg.extract_file_sync("large_module.py", config=config)
|
|
|
|
fmt = result.metadata.get("format")
|
|
if fmt and fmt.get("format_type") == "code":
|
|
for chunk in fmt.get("chunks", []):
|
|
# Each chunk is a semantically coherent piece of code
|
|
embedding = your_embedding_model(chunk["content"])
|
|
store_in_vector_db(
|
|
text=chunk["content"],
|
|
embedding=embedding,
|
|
metadata={
|
|
"language": chunk["language"],
|
|
"start_line": chunk["span"]["start_line"],
|
|
"parent": chunk.get("context", {}).get("parent_name"),
|
|
},
|
|
)
|
|
```
|
|
|
|
## Language Detection
|
|
|
|
Kreuzberg detects the programming language in two ways:
|
|
|
|
1. **File extension** (fast path) -- when using `extract_file`, the extension is matched against 248 known language extensions
|
|
2. **Shebang line** (fallback) -- when using `extract_bytes` or when the extension is ambiguous, the first line is checked for `#!/usr/bin/env python`, `#!/bin/bash`, and so on.
|
|
|
|
If neither method identifies the language, extraction returns an `UnsupportedFormat` error.
|
|
|
|
## Language Support
|
|
|
|
Tree-sitter-language-pack supports 300+ programming languages. For the full list, see the [TSLP language reference](https://docs.tree-sitter-language-pack.kreuzberg.dev).
|
|
|
|
Common languages with full structural analysis:
|
|
|
|
| Language | Structure | Imports | Exports | Docstrings |
|
|
| ---------- | --------- | ------- | ------- | ---------- |
|
|
| Python | Yes | Yes | Yes | Yes |
|
|
| Rust | Yes | Yes | Yes | Yes |
|
|
| TypeScript | Yes | Yes | Yes | Yes |
|
|
| JavaScript | Yes | Yes | Yes | Yes |
|
|
| Go | Yes | Yes | Yes | Yes |
|
|
| Java | Yes | Yes | Yes | Yes |
|
|
| C/C++ | Yes | Yes | Yes | Yes |
|
|
| Ruby | Yes | Yes | Yes | Yes |
|
|
| PHP | Yes | Yes | Yes | Yes |
|
|
| C# | Yes | Yes | Yes | Yes |
|
|
| Swift | Yes | Yes | Yes | Yes |
|
|
| Kotlin | Yes | Yes | Yes | Yes |
|
|
| Elixir | Yes | Yes | Yes | Yes |
|
|
|
|
## Related Documentation
|
|
|
|
- [Configuration Reference](../reference/configuration.md#treesitterconfig) -- TreeSitterConfig and TreeSitterProcessConfig fields
|
|
- [Types Reference](../reference/types.md) -- ProcessResult, StructureItem, CodeChunk, and related type definitions
|
|
- [tree-sitter-language-pack documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) -- Full language support reference
|