Scripts API¶
Note
This page is generated automatically from the repository’s maintained Python module inventory.
Operational, evaluation, and benchmarking scripts shipped with the repository.
scripts¶
Source: scripts/__init__.py
Repository-maintained operational, benchmarking, and evaluation scripts.
scripts.backfill_regulatory_level¶
Source: scripts/backfill_regulatory_level.py
Backfill Lamfalussy level for all acts in the database.
Uses the level field (1=L1, 2=L2, 3=L3) with inference from act metadata.
- Usage:
python scripts/backfill_lamfalussy.py [–dry-run]
scripts.bench_utils¶
Source: scripts/bench_utils.py
Shared utilities for benchmark and evaluation scripts.
- scripts.bench_utils.safe_mean(values)[source]¶
Return the arithmetic mean or
0.0for an empty sequence.- Parameters:
values (Sequence[float])
- Return type:
float
- scripts.bench_utils.percentile(values, p)[source]¶
Return the linear-interpolated percentile for values.
- Parameters:
values (Sequence[float])
p (float)
- Return type:
float
scripts.benchmark_graphrag¶
Source: scripts/benchmark_graphrag.py
Reproducible local GraphRAG microbenchmarks.
The goal is not to simulate end-to-end production latency. This script measures deterministic building blocks that materially affect GraphRAG responsiveness:
read-only Cypher validation,
graph node ranking,
token-budget-aware graph context assembly,
parallel semantic retrieval fan-out versus a sequential baseline.
- class scripts.benchmark_graphrag.BenchmarkSummary(name, median_ms, p95_ms, min_ms, max_ms, iterations, extra)[source]¶
Bases:
objectLatency summary for one benchmarked operation.
- Parameters:
name (str)
median_ms (float)
p95_ms (float)
min_ms (float)
max_ms (float)
iterations (int)
extra (dict[str, Any])
scripts.benchmark_rag¶
Source: scripts/benchmark_rag.py
Unified RAG benchmark pipeline.
Runs retrieval and/or generation evaluation, measures latency, computes deltas against a previous run, and produces a timestamped JSON report.
- Usage:
python scripts/benchmark_rag.py # full benchmark python scripts/benchmark_rag.py –retrieval-only # retrieval only (no LLM cost) python scripts/benchmark_rag.py –compare benchmarks/prev.json # compare with previous python scripts/benchmark_rag.py –tag “post-rerank-tuning”
- scripts.benchmark_rag.run_retrieval_benchmark(*, dataset_path, base_url, modes, top_k, score_threshold, timeout_seconds, filters=None)[source]¶
Run retrieval evaluation for each requested mode and collect summaries.
- Parameters:
dataset_path (Path)
base_url (str)
modes (Sequence[str])
top_k (int)
score_threshold (float | None)
timeout_seconds (float)
filters (Dict[str, Any] | None)
- Return type:
Dict[str, Any]
- scripts.benchmark_rag.run_generation_benchmark(*, dataset_path, base_url, default_mode, default_top_k, default_min_score, timeout_seconds, max_contexts, use_ragas, judge_provider, judge_model, judge_base_url, judge_api_key, judge_timeout_seconds, judge_temperature, faithfulness_threshold, metric_names)[source]¶
Run answer-generation evaluation and optional RAGAS scoring.
- Parameters:
dataset_path (Path)
base_url (str)
default_mode (str)
default_top_k (int)
default_min_score (float | None)
timeout_seconds (float)
max_contexts (int)
use_ragas (bool)
judge_provider (str)
judge_model (str)
judge_base_url (str)
judge_api_key (str | None)
judge_timeout_seconds (float)
judge_temperature (float)
faithfulness_threshold (float)
metric_names (Sequence[str])
- Return type:
Dict[str, Any]
scripts.diag_ner_vs_regex¶
Source: scripts/diag_ner_vs_regex.py
Diagnostic: compare prose linking with regex-only vs regex+NER.
For a curated list of test sentences, run prose_linker twice — once without the NER external detector (regex+fuzzy alias only) and once with it. Diff the outputs to show exactly what each layer adds.
Run inside the rag-service container so the DB-backed linker and the NER client are wired the same way as production:
docker exec -i rag-service python /tmp/diag_ner_vs_regex.py
- class scripts.diag_ner_vs_regex.Result(category, sentence, regex_links, full_links, ner_added)[source]¶
Bases:
objectOne row of the diagnostic table: a sentence and the links each layer found.
- Parameters:
category (str)
sentence (str)
regex_links (List[Tuple[str, str]])
full_links (List[Tuple[str, str]])
ner_added (List[Tuple[str, str]])
scripts.eval_ragas¶
Source: scripts/eval_ragas.py
Evaluate RAG quality against the running rag-service with RAGAS.
Input data can be JSON or JSONL. Each record contains a query and may also
define ground_truth, mode, top_k, and min_score overrides.
Reported metrics include faithfulness, answer_relevancy,
context_precision, and context_recall when ground-truth answers are
available.
- class scripts.eval_ragas.QAExample(query, ground_truth, mode, top_k, min_score)[source]¶
Bases:
objectOne QA evaluation example loaded from the dataset.
- Parameters:
query (str)
ground_truth (str | None)
mode (str | None)
top_k (int | None)
min_score (float | None)
- class scripts.eval_ragas.RagasInputRow(query, answer, contexts, ground_truth, mode, top_k, min_score, num_sources, query_id, latency_ms)[source]¶
Bases:
objectNormalized row sent to the RAGAS evaluation pipeline.
- Parameters:
query (str)
answer (str)
contexts (tuple[str, ...])
ground_truth (str | None)
mode (str)
top_k (int)
min_score (float | None)
num_sources (int)
query_id (str)
latency_ms (float)
- scripts.eval_ragas.load_dataset(dataset_path)[source]¶
Load QA examples from a JSON or JSONL dataset file.
- Parameters:
dataset_path (Path)
- Return type:
List[QAExample]
- scripts.eval_ragas.collect_samples(*, examples, base_url, default_mode, default_top_k, default_min_score, timeout_seconds, max_contexts, include_full_content)[source]¶
Query the RAG service and normalize responses for evaluation.
- Parameters:
examples (Sequence[QAExample])
base_url (str)
default_mode (str)
default_top_k (int)
default_min_score (float | None)
timeout_seconds (float)
max_contexts (int)
include_full_content (bool)
- Return type:
List[RagasInputRow]
- scripts.eval_ragas.run_ragas_evaluation(*, rows, judge_provider, judge_model, judge_base_url, judge_api_key, judge_timeout_seconds, judge_temperature, metric_names=AVAILABLE_METRICS, faithfulness_threshold=0.8, embeddings=None)[source]¶
Run RAGAS scoring for the collected rows and summarize the results.
- Parameters:
rows (Sequence[RagasInputRow])
judge_provider (str)
judge_model (str)
judge_base_url (str)
judge_api_key (str | None)
judge_timeout_seconds (float)
judge_temperature (float)
metric_names (Sequence[str])
faithfulness_threshold (float)
embeddings (Any)
- Return type:
Dict[str, Any]
- scripts.eval_ragas.collect_simple_metrics(rows)[source]¶
Fallback metrics when RAGAS is not installed.
- Parameters:
rows (Sequence[RagasInputRow])
- Return type:
Dict[str, Any]
scripts.eval_retrieval¶
Source: scripts/eval_retrieval.py
Run offline retrieval evaluation against the running RAG service.
The dataset may be provided as JSONL or as a JSON list. Each record defines a
query plus optional expected_celex and expected_subdivision_ids
values used to score retrieval quality.
- class scripts.eval_retrieval.EvalExample(query, expected_celex, expected_subdivision_ids)[source]¶
Bases:
objectOne retrieval evaluation example with expected hits.
- Parameters:
query (str)
expected_celex (tuple[str, ...])
expected_subdivision_ids (tuple[int, ...])
- scripts.eval_retrieval.load_dataset(dataset_path)[source]¶
Load retrieval examples from JSON or JSONL.
- Parameters:
dataset_path (Path)
- Return type:
List[EvalExample]
- scripts.eval_retrieval.evaluate(*, examples, base_url, top_k, mode, score_threshold, timeout_seconds, filters=None)[source]¶
Evaluate /search responses against the expected identifiers.
- Parameters:
examples (Sequence[EvalExample])
base_url (str)
top_k (int)
mode (str)
score_threshold (float | None)
timeout_seconds (float)
filters (Dict[str, Any] | None)
- Return type:
Dict[str, Any]
scripts.extract_scope¶
Source: scripts/extract_scope.py
Fast extraction of scope articles (Article 1-3) from EUR-Lex acts.
Strategy: single bulk SELECT joining acts + article subdivisions + their subtrees. Caps per-article content at ~2500 chars (scope info is always in first paragraphs).
scripts.generate_eval_dataset¶
Source: scripts/generate_eval_dataset.py
Generate evaluation datasets from database contents.
Connects to PostgreSQL, selects acts and subdivisions, then uses the LLM (Mistral) to generate test queries with expected results and ground-truth answers.
- Produces two JSONL files:
rag_test_retrieval.jsonl (queries + expected_celex)
rag_test_qa.jsonl (queries + ground_truth answers)
- Usage:
python scripts/generate_eval_dataset.py python scripts/generate_eval_dataset.py –limit 20 –output-dir scripts/datasets python scripts/generate_eval_dataset.py –celex 32016R0679 –qa-per-act 3
- scripts.generate_eval_dataset.generate_retrieval_queries(acts, llm_chain, queries_per_act)[source]¶
Generate retrieval-oriented evaluation queries for the selected acts.
- Parameters:
acts (list[Any])
llm_chain (Any)
queries_per_act (int)
- Return type:
list[dict[str, Any]]
- scripts.generate_eval_dataset.generate_qa_pairs(acts, repo, session, llm_chain, qa_per_act, min_content_chars)[source]¶
Generate grounded QA pairs from representative act subdivisions.
- Parameters:
acts (list[Any])
repo (PostgresRepository)
session (Any)
llm_chain (Any)
qa_per_act (int)
min_content_chars (int)
- Return type:
list[dict[str, Any]]
scripts.merge_entities¶
Source: scripts/merge_entities.py
Merge the 8 ENTITIES_part_NN.json files into a consolidated ENTITIES.md.
Deduplicates entities by (name_en, name_fr) and aggregates source_acts. Groups by category, sorts by frequency.
- scripts.merge_entities.normalize(s)[source]¶
Normalize entity labels for case-insensitive deduplication.
- Parameters:
s (str | None)
- Return type:
str
scripts.report_generator¶
Source: scripts/report_generator.py
Generate a Markdown benchmark report with embedded charts from a JSON report.
- Usage:
python scripts/report_generator.py benchmarks/2026-03-18_123000.json python scripts/report_generator.py benchmarks/2026-03-18_123000.json –compare benchmarks/prev.json
- scripts.report_generator.generate_report(report, output_path, charts_dir, previous_report=None)[source]¶
Generate a Markdown benchmark report with embedded charts.
- Parameters:
report (Dict[str, Any]) – Parsed JSON benchmark report.
output_path (Path) – Path for the .md file.
charts_dir (Path) – Directory for chart PNGs.
previous_report (Dict[str, Any] | None) – Optional previous report for comparison deltas.
- Returns:
Path to the written .md file.
- Return type:
Path
scripts.split_scope¶
Source: scripts/split_scope.py
Split SCOPE.md into N balanced chunks for parallel sub-agent processing.
Splits on act boundaries (## heading) so no act is truncated.