Files
fil/docs/guides/keywords.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

3.3 KiB
Raw Blame History

Keyword Extraction

Extract ranked keywords from document text using YAKE or RAKE algorithms.

Algorithm Scoring Best for
YAKE Lower score = more relevant (0.01.0) General documents, single terms, multilingual
RAKE Higher score = more relevant (unbounded) Multi-word phrases, technical docs

Quick Start

=== "Python"

--8<-- "snippets/python/utils/keyword_extraction_example.md"

=== "TypeScript"

--8<-- "snippets/typescript/utils/keyword_extraction_example.md"

=== "Rust"

--8<-- "snippets/rust/advanced/keyword_extraction_example.md"

=== "Go"

--8<-- "snippets/go/utils/keyword_extraction_example.md"

=== "Java"

--8<-- "snippets/java/utils/keyword_extraction_example.md"

=== "C#"

--8<-- "snippets/csharp/advanced/keyword_extraction_example.md"

=== "Ruby"

--8<-- "snippets/ruby/utils/keyword_extraction_example.md"

Keywords are returned in result.extracted_keywords as objects with text and score fields.

Configuration

See KeywordConfig reference for all configuration options.

=== "Python"

--8<-- "snippets/python/config/keyword_extraction_config.md"

=== "TypeScript"

--8<-- "snippets/typescript/config/keyword_extraction_config.md"

=== "Rust"

--8<-- "snippets/rust/config/keyword_extraction_config.md"

=== "Go"

--8<-- "snippets/go/config/keyword_extraction_config.md"

=== "Ruby"

--8<-- "snippets/ruby/config/keyword_extraction_config.md"

=== "R"

--8<-- "snippets/r/config/keyword_extraction_config.md"

=== "C#"

--8<-- "snippets/csharp/advanced/keyword_extraction_config.md"

YAKE Score Tuning

Use min_score as upper bound. Lower YAKE scores = higher relevance:

min_score Effect
0.5 Keeps most keywords
0.3 Main topics only
0.1 Core concepts only

yake_params.window_size controls co-occurrence context: 12 for narrow domains, 23 for general (default: 2), 34 for discussion-heavy content.

RAKE Score Tuning

Use min_score as lower bound. Higher RAKE scores = higher relevance:

min_score Effect
0.1 Keeps most keywords
5.0 Main phrases only
20.0 Only highly specific phrases

rake_params.min_word_length (default: 1) and rake_params.max_words_per_phrase (default: 3) control phrase boundaries.

Troubleshooting

  • Too few keywords — Lower min_score, check result.content is non-empty, set language to match the document or None to disable stopword filtering
  • Too many irrelevant keywords — Raise min_score, set language for stopword filtering, reduce ngram_range upper bound
  • Multi-word phrases missing (YAKE) — Switch to RAKE or confirm ngram_range upper bound is >= 2
  • Keywords don't match content — Verify text was extracted (result.content) and language matches the document

See the KeywordConfig reference for the full parameter list.