This commit is contained in:
249
docs/guides/code-intelligence.md
Normal file
249
docs/guides/code-intelligence.md
Normal file
@@ -0,0 +1,249 @@
|
||||
# Code Intelligence
|
||||
|
||||
Kreuzberg integrates [tree-sitter-language-pack](https://docs.tree-sitter-language-pack.kreuzberg.dev) (TSLP) to parse and analyze source code files. When you extract a source code file, Kreuzberg automatically detects the programming language and produces structured analysis alongside the raw text content.
|
||||
|
||||
## What You Get
|
||||
|
||||
When extracting source code, the `metadata.format` field contains a `ProcessResult` (format type `"code"`) with:
|
||||
|
||||
- **Structure** -- functions, classes, structs, methods, modules, and their nesting hierarchy
|
||||
- **Imports** -- import/include/require statements with source paths and imported items
|
||||
- **Exports** -- exported symbols with their kinds (function, class, variable, type, default)
|
||||
- **Comments** -- inline and block comments with their positions
|
||||
- **Docstrings** -- documentation comments with parsed sections (params, returns, etc.)
|
||||
- **Symbols** -- variable, constant, and type alias definitions
|
||||
- **Diagnostics** -- parse errors and warnings from tree-sitter
|
||||
- **Chunks** -- semantically meaningful code chunks for RAG and embedding pipelines
|
||||
- **Metrics** -- file-level statistics (lines of code, comment lines, empty lines, node count)
|
||||
|
||||
Language support covers **300+ programming languages** via tree-sitter grammars. See the [TSLP documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) for the full language list.
|
||||
|
||||
## Getting Started
|
||||
|
||||
Code intelligence is enabled by default when the `tree-sitter` feature flag is active. Simply extract a source code file:
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust title="basic.rs"
|
||||
use kreuzberg::{extract_file_sync, ExtractionConfig};
|
||||
|
||||
let config = ExtractionConfig::default();
|
||||
let result = extract_file_sync("app.py", None, &config)?;
|
||||
|
||||
// The content field has the raw source text
|
||||
println!("{}", result.content);
|
||||
|
||||
// Code intelligence is in metadata.format
|
||||
if let Some(kreuzberg::types::FormatMetadata::Code(ref code)) = result.metadata.format {
|
||||
println!("Language: {}", code.language);
|
||||
println!("Structures: {}", code.structure.len());
|
||||
println!("Imports: {}", code.imports.len());
|
||||
}
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python title="basic.py"
|
||||
import kreuzberg
|
||||
|
||||
config = kreuzberg.ExtractionConfig()
|
||||
result = kreuzberg.extract_file_sync("app.py", config=config)
|
||||
|
||||
# The content field has the raw source text
|
||||
print(result.content)
|
||||
|
||||
# Code intelligence is in metadata["format"]
|
||||
fmt = result.metadata.get("format")
|
||||
if fmt and fmt.get("format_type") == "code":
|
||||
print(f"Language: {fmt['language']}")
|
||||
print(f"Structures: {len(fmt['structure'])}")
|
||||
print(f"Imports: {len(fmt['imports'])}")
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript title="basic.ts"
|
||||
import { extractFileSync } from "@kreuzberg/node";
|
||||
|
||||
const result = extractFileSync("app.ts");
|
||||
|
||||
console.log(result.content);
|
||||
|
||||
const fmt = result.metadata?.format;
|
||||
if (fmt?.formatType === "code") {
|
||||
console.log(`Language: ${fmt.language}`);
|
||||
console.log(`Structures: ${fmt.structure.length}`);
|
||||
console.log(`Imports: ${fmt.imports.length}`);
|
||||
}
|
||||
```
|
||||
|
||||
=== "Go"
|
||||
|
||||
```go title="basic.go"
|
||||
result, err := kreuzberg.ExtractFileSync("app.py", nil)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
fmt.Println(result.Content)
|
||||
// Code intelligence is available in result.Metadata.Format
|
||||
// when Format.Type == "code"
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Use `TreeSitterConfig` to control which analysis features are enabled. Set `enabled: false` to disable code intelligence entirely. By default, `structure`, `imports`, and `exports` are enabled; `comments`, `docstrings`, `symbols`, and `diagnostics` are disabled.
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust title="config.rs"
|
||||
use kreuzberg::{ExtractionConfig, TreeSitterConfig, TreeSitterProcessConfig};
|
||||
|
||||
let config = ExtractionConfig {
|
||||
tree_sitter: Some(TreeSitterConfig {
|
||||
process: TreeSitterProcessConfig {
|
||||
structure: true, // functions, classes, etc. (default: true)
|
||||
imports: true, // import statements (default: true)
|
||||
exports: true, // export statements (default: true)
|
||||
comments: true, // comments (default: false)
|
||||
docstrings: true, // docstrings (default: false)
|
||||
symbols: true, // variables, constants (default: false)
|
||||
diagnostics: true, // parse errors/warnings (default: false)
|
||||
chunk_max_size: Some(4096), // max chunk size in bytes
|
||||
..Default::default()
|
||||
},
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python title="config.py"
|
||||
import kreuzberg
|
||||
|
||||
config = kreuzberg.ExtractionConfig(
|
||||
tree_sitter={
|
||||
"process": {
|
||||
"structure": True,
|
||||
"imports": True,
|
||||
"exports": True,
|
||||
"comments": True,
|
||||
"docstrings": True,
|
||||
"symbols": True,
|
||||
"diagnostics": True,
|
||||
"chunk_max_size": 4096,
|
||||
}
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript title="config.ts"
|
||||
import { ExtractionConfig } from "@kreuzberg/node";
|
||||
|
||||
const config: ExtractionConfig = {
|
||||
treeSitter: {
|
||||
process: {
|
||||
structure: true,
|
||||
imports: true,
|
||||
exports: true,
|
||||
comments: true,
|
||||
docstrings: true,
|
||||
symbols: true,
|
||||
diagnostics: true,
|
||||
chunkMaxSize: 4096,
|
||||
},
|
||||
},
|
||||
};
|
||||
```
|
||||
|
||||
=== "TOML"
|
||||
|
||||
```toml title="kreuzberg.toml"
|
||||
[tree_sitter.process]
|
||||
structure = true
|
||||
imports = true
|
||||
exports = true
|
||||
comments = true
|
||||
docstrings = true
|
||||
symbols = true
|
||||
diagnostics = true
|
||||
chunk_max_size = 4096
|
||||
```
|
||||
|
||||
### Configuration Fields
|
||||
|
||||
See [`TreeSitterConfig`](../reference/configuration.md#treesitterconfig) and [`TreeSitterProcessConfig`](../reference/configuration.md#treesitterprocessconfig) for all fields.
|
||||
|
||||
## ProcessResult Fields
|
||||
|
||||
Code intelligence results are returned as a `ProcessResult` from the upstream [`tree-sitter-language-pack`](https://docs.rs/tree-sitter-language-pack) crate. Top-level fields: `language`, `metrics`, `structure`, `imports`, `exports`, `chunks`, plus `comments` / `docstrings` / `symbols` / `diagnostics` (populated only when their `TreeSitterProcessConfig` flag is on). See the upstream crate docs for full field shapes.
|
||||
|
||||
## Semantic Chunking for RAG
|
||||
|
||||
Code chunks produced by tree-sitter are semantically aware -- they split at function, class, and module boundaries rather than fixed line counts. This makes them ideal for retrieval-augmented generation (RAG) pipelines:
|
||||
|
||||
```python title="rag_chunking.py"
|
||||
import kreuzberg
|
||||
|
||||
config = kreuzberg.ExtractionConfig(
|
||||
tree_sitter={"process": {"chunk_max_size": 2048}}
|
||||
)
|
||||
|
||||
result = kreuzberg.extract_file_sync("large_module.py", config=config)
|
||||
|
||||
fmt = result.metadata.get("format")
|
||||
if fmt and fmt.get("format_type") == "code":
|
||||
for chunk in fmt.get("chunks", []):
|
||||
# Each chunk is a semantically coherent piece of code
|
||||
embedding = your_embedding_model(chunk["content"])
|
||||
store_in_vector_db(
|
||||
text=chunk["content"],
|
||||
embedding=embedding,
|
||||
metadata={
|
||||
"language": chunk["language"],
|
||||
"start_line": chunk["span"]["start_line"],
|
||||
"parent": chunk.get("context", {}).get("parent_name"),
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
## Language Detection
|
||||
|
||||
Kreuzberg detects the programming language in two ways:
|
||||
|
||||
1. **File extension** (fast path) -- when using `extract_file`, the extension is matched against 248 known language extensions
|
||||
2. **Shebang line** (fallback) -- when using `extract_bytes` or when the extension is ambiguous, the first line is checked for `#!/usr/bin/env python`, `#!/bin/bash`, and so on.
|
||||
|
||||
If neither method identifies the language, extraction returns an `UnsupportedFormat` error.
|
||||
|
||||
## Language Support
|
||||
|
||||
Tree-sitter-language-pack supports 300+ programming languages. For the full list, see the [TSLP language reference](https://docs.tree-sitter-language-pack.kreuzberg.dev).
|
||||
|
||||
Common languages with full structural analysis:
|
||||
|
||||
| Language | Structure | Imports | Exports | Docstrings |
|
||||
| ---------- | --------- | ------- | ------- | ---------- |
|
||||
| Python | Yes | Yes | Yes | Yes |
|
||||
| Rust | Yes | Yes | Yes | Yes |
|
||||
| TypeScript | Yes | Yes | Yes | Yes |
|
||||
| JavaScript | Yes | Yes | Yes | Yes |
|
||||
| Go | Yes | Yes | Yes | Yes |
|
||||
| Java | Yes | Yes | Yes | Yes |
|
||||
| C/C++ | Yes | Yes | Yes | Yes |
|
||||
| Ruby | Yes | Yes | Yes | Yes |
|
||||
| PHP | Yes | Yes | Yes | Yes |
|
||||
| C# | Yes | Yes | Yes | Yes |
|
||||
| Swift | Yes | Yes | Yes | Yes |
|
||||
| Kotlin | Yes | Yes | Yes | Yes |
|
||||
| Elixir | Yes | Yes | Yes | Yes |
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Configuration Reference](../reference/configuration.md#treesitterconfig) -- TreeSitterConfig and TreeSitterProcessConfig fields
|
||||
- [Types Reference](../reference/types.md) -- ProcessResult, StructureItem, CodeChunk, and related type definitions
|
||||
- [tree-sitter-language-pack documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) -- Full language support reference
|
||||
Reference in New Issue
Block a user