Scripts API¶

Note

This page is generated automatically from the repository’s maintained Python module inventory.

Operational, evaluation, and benchmarking scripts shipped with the repository.

`scripts`¶

Source: scripts/__init__.py

Repository-maintained operational, benchmarking, and evaluation scripts.

`scripts.backfill_regulatory_level`¶

Source: scripts/backfill_regulatory_level.py

Backfill Lamfalussy level for all acts in the database.

Uses the level field (1=L1, 2=L2, 3=L3) with inference from act metadata.

Usage:: python scripts/backfill_lamfalussy.py [–dry-run]

scripts.backfill_regulatory_level.main()[source]¶

Infer and persist missing Lamfalussy levels for stored acts.

Return type:: None

`scripts.bench_utils`¶

Source: scripts/bench_utils.py

Shared utilities for benchmark and evaluation scripts.

scripts.bench_utils.safe_mean(values)[source]¶

Return the arithmetic mean or 0.0 for an empty sequence.

Parameters:: values (Sequence[float])
Return type:: float

scripts.bench_utils.percentile(values, p)[source]¶

Return the linear-interpolated percentile for values.

Parameters:

values (Sequence[float])
p (float)

Return type:

float

scripts.bench_utils.format_float(val, decimals=3)[source]¶

Format a float with a fixed number of decimal places.

Parameters:

val (float)
decimals (int)

Return type:

str

scripts.bench_utils.build_ragas_embeddings(provider, api_key)[source]¶

Build a RAGAS-compatible embeddings wrapper for the judge provider.

Parameters:

provider (str)
api_key (str | None)

Return type:

Any

`scripts.benchmark_graphrag`¶

Source: scripts/benchmark_graphrag.py

Reproducible local GraphRAG microbenchmarks.

The goal is not to simulate end-to-end production latency. This script measures deterministic building blocks that materially affect GraphRAG responsiveness:

read-only Cypher validation,
graph node ranking,
token-budget-aware graph context assembly,
parallel semantic retrieval fan-out versus a sequential baseline.

class scripts.benchmark_graphrag.BenchmarkSummary(name, median_ms, p95_ms, min_ms, max_ms, iterations, extra)[source]¶

Bases: object

Latency summary for one benchmarked operation.

Parameters:

name (str)
median_ms (float)
p95_ms (float)
min_ms (float)
max_ms (float)
iterations (int)
extra (dict[str, Any])

scripts.benchmark_graphrag.main()[source]¶

Run the local GraphRAG benchmark suite and write a JSON report.

Return type:: None

`scripts.benchmark_rag`¶

Source: scripts/benchmark_rag.py

Unified RAG benchmark pipeline.

Runs retrieval and/or generation evaluation, measures latency, computes deltas against a previous run, and produces a timestamped JSON report.

Usage:: python scripts/benchmark_rag.py # full benchmark python scripts/benchmark_rag.py –retrieval-only # retrieval only (no LLM cost) python scripts/benchmark_rag.py –compare benchmarks/prev.json # compare with previous python scripts/benchmark_rag.py –tag “post-rerank-tuning”

scripts.benchmark_rag.run_retrieval_benchmark(*, dataset_path, base_url, modes, top_k, score_threshold, timeout_seconds, filters=None)[source]¶

Run retrieval evaluation for each requested mode and collect summaries.

Parameters:

dataset_path (Path)
base_url (str)
modes (Sequence[str])
top_k (int)
score_threshold (float | None)
timeout_seconds (float)
filters (Dict[str, Any] | None)

Return type:

Dict[str, Any]

scripts.benchmark_rag.run_generation_benchmark(*, dataset_path, base_url, default_mode, default_top_k, default_min_score, timeout_seconds, max_contexts, use_ragas, judge_provider, judge_model, judge_base_url, judge_api_key, judge_timeout_seconds, judge_temperature, faithfulness_threshold, metric_names)[source]¶

Run answer-generation evaluation and optional RAGAS scoring.

Parameters:

dataset_path (Path)
base_url (str)
default_mode (str)
default_top_k (int)
default_min_score (float | None)
timeout_seconds (float)
max_contexts (int)
use_ragas (bool)
judge_provider (str)
judge_model (str)
judge_base_url (str)
judge_api_key (str | None)
judge_timeout_seconds (float)
judge_temperature (float)
faithfulness_threshold (float)
metric_names (Sequence[str])

Return type:

Dict[str, Any]

scripts.benchmark_rag.parse_args()[source]¶

Parse CLI arguments for the unified benchmark runner.

Return type:: Namespace

scripts.benchmark_rag.main()[source]¶

Execute the benchmark pipeline and persist the generated report.

Return type:: int

`scripts.diag_ner_vs_regex`¶

Source: scripts/diag_ner_vs_regex.py

Diagnostic: compare prose linking with regex-only vs regex+NER.

For a curated list of test sentences, run prose_linker twice — once without the NER external detector (regex+fuzzy alias only) and once with it. Diff the outputs to show exactly what each layer adds.

Run inside the rag-service container so the DB-backed linker and the NER client are wired the same way as production:

docker exec -i rag-service python /tmp/diag_ner_vs_regex.py

class scripts.diag_ner_vs_regex.Result(category, sentence, regex_links, full_links, ner_added)[source]¶

Bases: object

One row of the diagnostic table: a sentence and the links each layer found.

Parameters:

category (str)
sentence (str)
regex_links (List[Tuple[str, str]])
full_links (List[Tuple[str, str]])
ner_added (List[Tuple[str, str]])

scripts.diag_ner_vs_regex.main()[source]¶

Run the diagnostic and print a per-sentence comparison of regex vs NER.

Return type:: int

`scripts.eval_ragas`¶

Source: scripts/eval_ragas.py

Evaluate RAG quality against the running rag-service with RAGAS.

Input data can be JSON or JSONL. Each record contains a query and may also define ground_truth, mode, top_k, and min_score overrides.

Reported metrics include faithfulness, answer_relevancy, context_precision, and context_recall when ground-truth answers are available.

class scripts.eval_ragas.QAExample(query, ground_truth, mode, top_k, min_score)[source]¶

Bases: object

One QA evaluation example loaded from the dataset.

Parameters:

query (str)
ground_truth (str | None)
mode (str | None)
top_k (int | None)
min_score (float | None)

class scripts.eval_ragas.RagasInputRow(query, answer, contexts, ground_truth, mode, top_k, min_score, num_sources, query_id, latency_ms)[source]¶

Bases: object

Normalized row sent to the RAGAS evaluation pipeline.

Parameters:

query (str)
answer (str)
contexts (tuple[str, ...])
ground_truth (str | None)
mode (str)
top_k (int)
min_score (float | None)
num_sources (int)
query_id (str)
latency_ms (float)

scripts.eval_ragas.load_dataset(dataset_path)[source]¶

Load QA examples from a JSON or JSONL dataset file.

Parameters:: dataset_path (Path)
Return type:: List[QAExample]

scripts.eval_ragas.collect_samples(*, examples, base_url, default_mode, default_top_k, default_min_score, timeout_seconds, max_contexts, include_full_content)[source]¶

Query the RAG service and normalize responses for evaluation.

Parameters:

examples (Sequence[QAExample])
base_url (str)
default_mode (str)
default_top_k (int)
default_min_score (float | None)
timeout_seconds (float)
max_contexts (int)
include_full_content (bool)

Return type:

List[RagasInputRow]

scripts.eval_ragas.run_ragas_evaluation(*, rows, judge_provider, judge_model, judge_base_url, judge_api_key, judge_timeout_seconds, judge_temperature, metric_names=AVAILABLE_METRICS, faithfulness_threshold=0.8, embeddings=None)[source]¶

Run RAGAS scoring for the collected rows and summarize the results.

Parameters:

rows (Sequence[RagasInputRow])
judge_provider (str)
judge_model (str)
judge_base_url (str)
judge_api_key (str | None)
judge_timeout_seconds (float)
judge_temperature (float)
metric_names (Sequence[str])
faithfulness_threshold (float)
embeddings (Any)

Return type:

Dict[str, Any]

scripts.eval_ragas.collect_simple_metrics(rows)[source]¶

Fallback metrics when RAGAS is not installed.

Parameters:: rows (Sequence[RagasInputRow])
Return type:: Dict[str, Any]

scripts.eval_ragas.parse_args()[source]¶

Parse CLI arguments for the RAGAS evaluation runner.

Return type:: Namespace

scripts.eval_ragas.main()[source]¶

Run the RAGAS CLI workflow and print the JSON report.

Return type:: int

`scripts.eval_retrieval`¶

Source: scripts/eval_retrieval.py

Run offline retrieval evaluation against the running RAG service.

The dataset may be provided as JSONL or as a JSON list. Each record defines a query plus optional expected_celex and expected_subdivision_ids values used to score retrieval quality.

class scripts.eval_retrieval.EvalExample(query, expected_celex, expected_subdivision_ids)[source]¶

Bases: object

One retrieval evaluation example with expected hits.

Parameters:

query (str)
expected_celex (tuple[str, ...])
expected_subdivision_ids (tuple[int, ...])

scripts.eval_retrieval.load_dataset(dataset_path)[source]¶

Load retrieval examples from JSON or JSONL.

Parameters:: dataset_path (Path)
Return type:: List[EvalExample]

scripts.eval_retrieval.evaluate(*, examples, base_url, top_k, mode, score_threshold, timeout_seconds, filters=None)[source]¶

Evaluate /search responses against the expected identifiers.

Parameters:

examples (Sequence[EvalExample])
base_url (str)
top_k (int)
mode (str)
score_threshold (float | None)
timeout_seconds (float)
filters (Dict[str, Any] | None)

Return type:

Dict[str, Any]

scripts.eval_retrieval.parse_args()[source]¶

Parse CLI arguments for the retrieval evaluation command.

Return type:: Namespace

scripts.eval_retrieval.main()[source]¶

Run the retrieval evaluation CLI and print the JSON report.

Return type:: int

`scripts.extract_scope`¶

Source: scripts/extract_scope.py

Fast extraction of scope articles (Article 1-3) from EUR-Lex acts.

Strategy: single bulk SELECT joining acts + article subdivisions + their subtrees. Caps per-article content at ~2500 chars (scope info is always in first paragraphs).

scripts.extract_scope.main()[source]¶: Extract the first scope articles for each act into a Markdown digest.

`scripts.generate_eval_dataset`¶

Source: scripts/generate_eval_dataset.py

Generate evaluation datasets from database contents.

Connects to PostgreSQL, selects acts and subdivisions, then uses the LLM (Mistral) to generate test queries with expected results and ground-truth answers.

Produces two JSONL files:

rag_test_retrieval.jsonl (queries + expected_celex)
rag_test_qa.jsonl (queries + ground_truth answers)

Usage:

python scripts/generate_eval_dataset.py python scripts/generate_eval_dataset.py –limit 20 –output-dir scripts/datasets python scripts/generate_eval_dataset.py –celex 32016R0679 –qa-per-act 3

scripts.generate_eval_dataset.generate_retrieval_queries(acts, llm_chain, queries_per_act)[source]¶

Generate retrieval-oriented evaluation queries for the selected acts.

Parameters:

acts (list[Any])
llm_chain (Any)
queries_per_act (int)

Return type:

list[dict[str, Any]]

scripts.generate_eval_dataset.generate_qa_pairs(acts, repo, session, llm_chain, qa_per_act, min_content_chars)[source]¶

Generate grounded QA pairs from representative act subdivisions.

Parameters:

acts (list[Any])
repo (PostgresRepository)
session (Any)
llm_chain (Any)
qa_per_act (int)
min_content_chars (int)

Return type:

list[dict[str, Any]]

scripts.generate_eval_dataset.parse_args()[source]¶

Parse CLI arguments for dataset generation.

Return type:: Namespace

scripts.generate_eval_dataset.main()[source]¶

Generate retrieval and QA datasets from the current database snapshot.

Return type:: int

`scripts.merge_entities`¶

Source: scripts/merge_entities.py

Merge the 8 ENTITIES_part_NN.json files into a consolidated ENTITIES.md.

Deduplicates entities by (name_en, name_fr) and aggregates source_acts. Groups by category, sorts by frequency.

scripts.merge_entities.normalize(s)[source]¶

Normalize entity labels for case-insensitive deduplication.

Parameters:: s (str | None)
Return type:: str

scripts.merge_entities.make_key(entity)[source]¶

Dedup key: prefer EN name, fallback to FR.

Parameters:: entity (dict)
Return type:: str

scripts.merge_entities.main()[source]¶: Merge partial entity inventories into a consolidated Markdown report.

`scripts.report_generator`¶

Source: scripts/report_generator.py

Generate a Markdown benchmark report with embedded charts from a JSON report.

Usage:: python scripts/report_generator.py benchmarks/2026-03-18_123000.json python scripts/report_generator.py benchmarks/2026-03-18_123000.json –compare benchmarks/prev.json

scripts.report_generator.generate_report(report, output_path, charts_dir, previous_report=None)[source]¶

Generate a Markdown benchmark report with embedded charts.

Parameters:

report (Dict[str, Any]) – Parsed JSON benchmark report.
output_path (Path) – Path for the .md file.
charts_dir (Path) – Directory for chart PNGs.
previous_report (Dict[str, Any] | None) – Optional previous report for comparison deltas.

Returns:

Path to the written .md file.

Return type:

Path

scripts.report_generator.main()[source]¶

Convert a benchmark JSON report into Markdown plus chart assets.

Return type:: int

`scripts.split_scope`¶

Source: scripts/split_scope.py

Split SCOPE.md into N balanced chunks for parallel sub-agent processing.

Splits on act boundaries (## heading) so no act is truncated.

scripts.split_scope.main()[source]¶: Split SCOPE.md into balanced chunks without cutting act sections.

Scripts API¶

scripts¶

scripts.backfill_regulatory_level¶

scripts.bench_utils¶

scripts.benchmark_graphrag¶

scripts.benchmark_rag¶

scripts.diag_ner_vs_regex¶

scripts.eval_ragas¶

scripts.eval_retrieval¶

scripts.extract_scope¶

scripts.generate_eval_dataset¶

scripts.merge_entities¶

scripts.report_generator¶

scripts.split_scope¶

`scripts`¶

`scripts.backfill_regulatory_level`¶

`scripts.bench_utils`¶

`scripts.benchmark_graphrag`¶

`scripts.benchmark_rag`¶

`scripts.diag_ner_vs_regex`¶

`scripts.eval_ragas`¶

`scripts.eval_retrieval`¶

`scripts.extract_scope`¶

`scripts.generate_eval_dataset`¶

`scripts.merge_entities`¶

`scripts.report_generator`¶

`scripts.split_scope`¶