Scripts API

Note

This page is generated automatically from the repository’s maintained Python module inventory.

Operational, evaluation, and benchmarking scripts shipped with the repository.

scripts

Source: scripts/__init__.py

Repository-maintained operational, benchmarking, and evaluation scripts.

scripts.backfill_regulatory_level

Source: scripts/backfill_regulatory_level.py

Backfill Lamfalussy level for all acts in the database.

Uses the level field (1=L1, 2=L2, 3=L3) with inference from act metadata.

Usage:

python scripts/backfill_lamfalussy.py [–dry-run]

scripts.backfill_regulatory_level.main()[source]

Infer and persist missing Lamfalussy levels for stored acts.

Return type:

None

scripts.bench_utils

Source: scripts/bench_utils.py

Shared utilities for benchmark and evaluation scripts.

scripts.bench_utils.safe_mean(values)[source]

Return the arithmetic mean or 0.0 for an empty sequence.

Parameters:

values (Sequence[float])

Return type:

float

scripts.bench_utils.percentile(values, p)[source]

Return the linear-interpolated percentile for values.

Parameters:
  • values (Sequence[float])

  • p (float)

Return type:

float

scripts.bench_utils.format_float(val, decimals=3)[source]

Format a float with a fixed number of decimal places.

Parameters:
  • val (float)

  • decimals (int)

Return type:

str

scripts.bench_utils.build_ragas_embeddings(provider, api_key)[source]

Build a RAGAS-compatible embeddings wrapper for the judge provider.

Parameters:
  • provider (str)

  • api_key (str | None)

Return type:

Any

scripts.benchmark_graphrag

Source: scripts/benchmark_graphrag.py

Reproducible local GraphRAG microbenchmarks.

The goal is not to simulate end-to-end production latency. This script measures deterministic building blocks that materially affect GraphRAG responsiveness:

  • read-only Cypher validation,

  • graph node ranking,

  • token-budget-aware graph context assembly,

  • parallel semantic retrieval fan-out versus a sequential baseline.

class scripts.benchmark_graphrag.BenchmarkSummary(name, median_ms, p95_ms, min_ms, max_ms, iterations, extra)[source]

Bases: object

Latency summary for one benchmarked operation.

Parameters:
  • name (str)

  • median_ms (float)

  • p95_ms (float)

  • min_ms (float)

  • max_ms (float)

  • iterations (int)

  • extra (dict[str, Any])

scripts.benchmark_graphrag.main()[source]

Run the local GraphRAG benchmark suite and write a JSON report.

Return type:

None

scripts.benchmark_rag

Source: scripts/benchmark_rag.py

Unified RAG benchmark pipeline.

Runs retrieval and/or generation evaluation, measures latency, computes deltas against a previous run, and produces a timestamped JSON report.

Usage:

python scripts/benchmark_rag.py # full benchmark python scripts/benchmark_rag.py –retrieval-only # retrieval only (no LLM cost) python scripts/benchmark_rag.py –compare benchmarks/prev.json # compare with previous python scripts/benchmark_rag.py –tag “post-rerank-tuning”

scripts.benchmark_rag.run_retrieval_benchmark(*, dataset_path, base_url, modes, top_k, score_threshold, timeout_seconds, filters=None)[source]

Run retrieval evaluation for each requested mode and collect summaries.

Parameters:
  • dataset_path (Path)

  • base_url (str)

  • modes (Sequence[str])

  • top_k (int)

  • score_threshold (float | None)

  • timeout_seconds (float)

  • filters (Dict[str, Any] | None)

Return type:

Dict[str, Any]

scripts.benchmark_rag.run_generation_benchmark(*, dataset_path, base_url, default_mode, default_top_k, default_min_score, timeout_seconds, max_contexts, use_ragas, judge_provider, judge_model, judge_base_url, judge_api_key, judge_timeout_seconds, judge_temperature, faithfulness_threshold, metric_names)[source]

Run answer-generation evaluation and optional RAGAS scoring.

Parameters:
  • dataset_path (Path)

  • base_url (str)

  • default_mode (str)

  • default_top_k (int)

  • default_min_score (float | None)

  • timeout_seconds (float)

  • max_contexts (int)

  • use_ragas (bool)

  • judge_provider (str)

  • judge_model (str)

  • judge_base_url (str)

  • judge_api_key (str | None)

  • judge_timeout_seconds (float)

  • judge_temperature (float)

  • faithfulness_threshold (float)

  • metric_names (Sequence[str])

Return type:

Dict[str, Any]

scripts.benchmark_rag.parse_args()[source]

Parse CLI arguments for the unified benchmark runner.

Return type:

Namespace

scripts.benchmark_rag.main()[source]

Execute the benchmark pipeline and persist the generated report.

Return type:

int

scripts.diag_ner_vs_regex

Source: scripts/diag_ner_vs_regex.py

Diagnostic: compare prose linking with regex-only vs regex+NER.

For a curated list of test sentences, run prose_linker twice — once without the NER external detector (regex+fuzzy alias only) and once with it. Diff the outputs to show exactly what each layer adds.

Run inside the rag-service container so the DB-backed linker and the NER client are wired the same way as production:

docker exec -i rag-service python /tmp/diag_ner_vs_regex.py

class scripts.diag_ner_vs_regex.Result(category, sentence, regex_links, full_links, ner_added)[source]

Bases: object

One row of the diagnostic table: a sentence and the links each layer found.

Parameters:
  • category (str)

  • sentence (str)

  • regex_links (List[Tuple[str, str]])

  • full_links (List[Tuple[str, str]])

  • ner_added (List[Tuple[str, str]])

scripts.diag_ner_vs_regex.main()[source]

Run the diagnostic and print a per-sentence comparison of regex vs NER.

Return type:

int

scripts.eval_ragas

Source: scripts/eval_ragas.py

Evaluate RAG quality against the running rag-service with RAGAS.

Input data can be JSON or JSONL. Each record contains a query and may also define ground_truth, mode, top_k, and min_score overrides.

Reported metrics include faithfulness, answer_relevancy, context_precision, and context_recall when ground-truth answers are available.

class scripts.eval_ragas.QAExample(query, ground_truth, mode, top_k, min_score)[source]

Bases: object

One QA evaluation example loaded from the dataset.

Parameters:
  • query (str)

  • ground_truth (str | None)

  • mode (str | None)

  • top_k (int | None)

  • min_score (float | None)

class scripts.eval_ragas.RagasInputRow(query, answer, contexts, ground_truth, mode, top_k, min_score, num_sources, query_id, latency_ms)[source]

Bases: object

Normalized row sent to the RAGAS evaluation pipeline.

Parameters:
  • query (str)

  • answer (str)

  • contexts (tuple[str, ...])

  • ground_truth (str | None)

  • mode (str)

  • top_k (int)

  • min_score (float | None)

  • num_sources (int)

  • query_id (str)

  • latency_ms (float)

scripts.eval_ragas.load_dataset(dataset_path)[source]

Load QA examples from a JSON or JSONL dataset file.

Parameters:

dataset_path (Path)

Return type:

List[QAExample]

scripts.eval_ragas.collect_samples(*, examples, base_url, default_mode, default_top_k, default_min_score, timeout_seconds, max_contexts, include_full_content)[source]

Query the RAG service and normalize responses for evaluation.

Parameters:
  • examples (Sequence[QAExample])

  • base_url (str)

  • default_mode (str)

  • default_top_k (int)

  • default_min_score (float | None)

  • timeout_seconds (float)

  • max_contexts (int)

  • include_full_content (bool)

Return type:

List[RagasInputRow]

scripts.eval_ragas.run_ragas_evaluation(*, rows, judge_provider, judge_model, judge_base_url, judge_api_key, judge_timeout_seconds, judge_temperature, metric_names=AVAILABLE_METRICS, faithfulness_threshold=0.8, embeddings=None)[source]

Run RAGAS scoring for the collected rows and summarize the results.

Parameters:
  • rows (Sequence[RagasInputRow])

  • judge_provider (str)

  • judge_model (str)

  • judge_base_url (str)

  • judge_api_key (str | None)

  • judge_timeout_seconds (float)

  • judge_temperature (float)

  • metric_names (Sequence[str])

  • faithfulness_threshold (float)

  • embeddings (Any)

Return type:

Dict[str, Any]

scripts.eval_ragas.collect_simple_metrics(rows)[source]

Fallback metrics when RAGAS is not installed.

Parameters:

rows (Sequence[RagasInputRow])

Return type:

Dict[str, Any]

scripts.eval_ragas.parse_args()[source]

Parse CLI arguments for the RAGAS evaluation runner.

Return type:

Namespace

scripts.eval_ragas.main()[source]

Run the RAGAS CLI workflow and print the JSON report.

Return type:

int

scripts.eval_retrieval

Source: scripts/eval_retrieval.py

Run offline retrieval evaluation against the running RAG service.

The dataset may be provided as JSONL or as a JSON list. Each record defines a query plus optional expected_celex and expected_subdivision_ids values used to score retrieval quality.

class scripts.eval_retrieval.EvalExample(query, expected_celex, expected_subdivision_ids)[source]

Bases: object

One retrieval evaluation example with expected hits.

Parameters:
  • query (str)

  • expected_celex (tuple[str, ...])

  • expected_subdivision_ids (tuple[int, ...])

scripts.eval_retrieval.load_dataset(dataset_path)[source]

Load retrieval examples from JSON or JSONL.

Parameters:

dataset_path (Path)

Return type:

List[EvalExample]

scripts.eval_retrieval.evaluate(*, examples, base_url, top_k, mode, score_threshold, timeout_seconds, filters=None)[source]

Evaluate /search responses against the expected identifiers.

Parameters:
  • examples (Sequence[EvalExample])

  • base_url (str)

  • top_k (int)

  • mode (str)

  • score_threshold (float | None)

  • timeout_seconds (float)

  • filters (Dict[str, Any] | None)

Return type:

Dict[str, Any]

scripts.eval_retrieval.parse_args()[source]

Parse CLI arguments for the retrieval evaluation command.

Return type:

Namespace

scripts.eval_retrieval.main()[source]

Run the retrieval evaluation CLI and print the JSON report.

Return type:

int

scripts.extract_scope

Source: scripts/extract_scope.py

Fast extraction of scope articles (Article 1-3) from EUR-Lex acts.

Strategy: single bulk SELECT joining acts + article subdivisions + their subtrees. Caps per-article content at ~2500 chars (scope info is always in first paragraphs).

scripts.extract_scope.main()[source]

Extract the first scope articles for each act into a Markdown digest.

scripts.generate_eval_dataset

Source: scripts/generate_eval_dataset.py

Generate evaluation datasets from database contents.

Connects to PostgreSQL, selects acts and subdivisions, then uses the LLM (Mistral) to generate test queries with expected results and ground-truth answers.

Produces two JSONL files:
  • rag_test_retrieval.jsonl (queries + expected_celex)

  • rag_test_qa.jsonl (queries + ground_truth answers)

Usage:

python scripts/generate_eval_dataset.py python scripts/generate_eval_dataset.py –limit 20 –output-dir scripts/datasets python scripts/generate_eval_dataset.py –celex 32016R0679 –qa-per-act 3

scripts.generate_eval_dataset.generate_retrieval_queries(acts, llm_chain, queries_per_act)[source]

Generate retrieval-oriented evaluation queries for the selected acts.

Parameters:
  • acts (list[Any])

  • llm_chain (Any)

  • queries_per_act (int)

Return type:

list[dict[str, Any]]

scripts.generate_eval_dataset.generate_qa_pairs(acts, repo, session, llm_chain, qa_per_act, min_content_chars)[source]

Generate grounded QA pairs from representative act subdivisions.

Parameters:
  • acts (list[Any])

  • repo (PostgresRepository)

  • session (Any)

  • llm_chain (Any)

  • qa_per_act (int)

  • min_content_chars (int)

Return type:

list[dict[str, Any]]

scripts.generate_eval_dataset.parse_args()[source]

Parse CLI arguments for dataset generation.

Return type:

Namespace

scripts.generate_eval_dataset.main()[source]

Generate retrieval and QA datasets from the current database snapshot.

Return type:

int

scripts.merge_entities

Source: scripts/merge_entities.py

Merge the 8 ENTITIES_part_NN.json files into a consolidated ENTITIES.md.

Deduplicates entities by (name_en, name_fr) and aggregates source_acts. Groups by category, sorts by frequency.

scripts.merge_entities.normalize(s)[source]

Normalize entity labels for case-insensitive deduplication.

Parameters:

s (str | None)

Return type:

str

scripts.merge_entities.make_key(entity)[source]

Dedup key: prefer EN name, fallback to FR.

Parameters:

entity (dict)

Return type:

str

scripts.merge_entities.main()[source]

Merge partial entity inventories into a consolidated Markdown report.

scripts.report_generator

Source: scripts/report_generator.py

Generate a Markdown benchmark report with embedded charts from a JSON report.

Usage:

python scripts/report_generator.py benchmarks/2026-03-18_123000.json python scripts/report_generator.py benchmarks/2026-03-18_123000.json –compare benchmarks/prev.json

scripts.report_generator.generate_report(report, output_path, charts_dir, previous_report=None)[source]

Generate a Markdown benchmark report with embedded charts.

Parameters:
  • report (Dict[str, Any]) – Parsed JSON benchmark report.

  • output_path (Path) – Path for the .md file.

  • charts_dir (Path) – Directory for chart PNGs.

  • previous_report (Dict[str, Any] | None) – Optional previous report for comparison deltas.

Returns:

Path to the written .md file.

Return type:

Path

scripts.report_generator.main()[source]

Convert a benchmark JSON report into Markdown plus chart assets.

Return type:

int

scripts.split_scope

Source: scripts/split_scope.py

Split SCOPE.md into N balanced chunks for parallel sub-agent processing.

Splits on act boundaries (## heading) so no act is truncated.

scripts.split_scope.main()[source]

Split SCOPE.md into balanced chunks without cutting act sections.