Observability¶

The observability stack combines Prometheus, provisioned Grafana dashboards, and service-local metrics endpoints.

Production Links¶

The operational endpoints below are live production surfaces. For the full inventory, see Production Surfaces.

Surface	URL	Usage
`docs`	https://docs.prod.raglogic.com	Official production technical documentation.
`grafana`	https://grafana.prod.raglogic.com	Dashboards and runtime metrics.
`qdrant`	https://qdrant.prod.raglogic.com/dashboard	Vector collection inspection.
`neo4j`	https://neo4j.prod.raglogic.com	Graph inspection and troubleshooting.
`bolt`	neo4j+s://bolt.prod.raglogic.com:443	Driver and Neo4j Browser Bolt access.
`portainer`	https://portainer.prod.raglogic.com	Container state and deployment inspection.

Provisioned dashboards¶

Dashboard JSON	Intent
`monitoring/grafana/dashboards/engine-rag.json`	Runtime and indexing overview
`monitoring/grafana/dashboards/ingestion-runs.json`	Live ingestion progress and failures
`monitoring/grafana/dashboards/vps-resources.json`	Host and container resource usage

Prometheus scrape inventory¶

Source of truth: monitoring/prometheus.yml.

Job	Target
`api_gateway`	`api-gateway:8000/metrics`
`rag_service`	`rag-service:8001/metrics`
`embedding_service`	`embedding-service:8002/metrics`
`chunking_worker`	`chunking-worker:9109/metrics`
`embedding_worker`	`embedding-worker:9108/metrics`
`embedding_worker_e5`	`embedding-worker-e5:9108/metrics`
`extraction_worker`	`extraction-worker:9107/metrics`
`postgresql`	`postgres_exporter:9187`
`redis`	`redis_exporter:9121/metrics`
`pdf_extract`	`pdf-extract:8080/metrics`
`docker_stats`	`docker-stats-exporter:9417/`
`node`	`node-exporter:9100`

Note

rerank-service is covered today through health probes surfaced by api-gateway and rag-service. It does not have a dedicated Prometheus scrape job in monitoring/prometheus.yml.

Core runtime signals¶

Signal family	Metric examples
Gateway edge behavior	`lalandre_api_gateway_query_requests_total`, `lalandre_api_gateway_proxy_errors_total`
RAG request behavior	`lalandre_rag_service_query_requests_total`, `lalandre_rag_service_query_duration_seconds`
Phase timing	`lalandre_rag_service_phase_duration_seconds`
Provider failures	`lalandre_rag_service_provider_errors_total`
Retrieval failures	`lalandre_rag_retrieval_errors_total`
Backend health	`lalandre_api_gateway_backend_health`, `lalandre_rag_service_backend_health`

Current assessment¶

The Engine/RAG dashboard now covers the runtime metrics surface expected by the current query stack.
The chart audit pages still separate provisioned panels from expectations, but the critical runtime gaps are now closed for Engine/RAG.

For the architectural role of each async worker and the queue/job chain behind these metrics endpoints, see Workers.