Files
fil/docs/guides/keywords.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

106 lines
3.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Keyword Extraction
Extract ranked keywords from document text using YAKE or RAKE algorithms.
| Algorithm | Scoring | Best for |
| --------- | ---------------------------------------- | --------------------------------------------- |
| **YAKE** | Lower score = more relevant (0.01.0) | General documents, single terms, multilingual |
| **RAKE** | Higher score = more relevant (unbounded) | Multi-word phrases, technical docs |
## Quick Start
=== "Python"
--8<-- "snippets/python/utils/keyword_extraction_example.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/keyword_extraction_example.md"
=== "Rust"
--8<-- "snippets/rust/advanced/keyword_extraction_example.md"
=== "Go"
--8<-- "snippets/go/utils/keyword_extraction_example.md"
=== "Java"
--8<-- "snippets/java/utils/keyword_extraction_example.md"
=== "C#"
--8<-- "snippets/csharp/advanced/keyword_extraction_example.md"
=== "Ruby"
--8<-- "snippets/ruby/utils/keyword_extraction_example.md"
Keywords are returned in `result.extracted_keywords` as objects with `text` and `score` fields.
## Configuration
See [KeywordConfig reference](../reference/configuration.md#keywordconfig) for all configuration options.
=== "Python"
--8<-- "snippets/python/config/keyword_extraction_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/keyword_extraction_config.md"
=== "Rust"
--8<-- "snippets/rust/config/keyword_extraction_config.md"
=== "Go"
--8<-- "snippets/go/config/keyword_extraction_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/keyword_extraction_config.md"
=== "R"
--8<-- "snippets/r/config/keyword_extraction_config.md"
=== "C#"
--8<-- "snippets/csharp/advanced/keyword_extraction_config.md"
## YAKE Score Tuning
Use `min_score` as upper bound. Lower YAKE scores = higher relevance:
| `min_score` | Effect |
| ----------- | ------------------- |
| `0.5` | Keeps most keywords |
| `0.3` | Main topics only |
| `0.1` | Core concepts only |
`yake_params.window_size` controls co-occurrence context: `12` for narrow domains, `23` for general (default: `2`), `34` for discussion-heavy content.
## RAKE Score Tuning
Use `min_score` as lower bound. Higher RAKE scores = higher relevance:
| `min_score` | Effect |
| ----------- | ---------------------------- |
| `0.1` | Keeps most keywords |
| `5.0` | Main phrases only |
| `20.0` | Only highly specific phrases |
`rake_params.min_word_length` (default: `1`) and `rake_params.max_words_per_phrase` (default: `3`) control phrase boundaries.
## Troubleshooting
- **Too few keywords** — Lower `min_score`, check `result.content` is non-empty, set `language` to match the document or `None` to disable stopword filtering
- **Too many irrelevant keywords** — Raise `min_score`, set `language` for stopword filtering, reduce `ngram_range` upper bound
- **Multi-word phrases missing (YAKE)** — Switch to RAKE or confirm `ngram_range` upper bound is >= 2
- **Keywords don't match content** — Verify text was extracted (`result.content`) and `language` matches the document
See the [KeywordConfig reference](../reference/configuration.md#keywordconfig) for the full parameter list.