Chunking API¶
Note
This page is generated automatically from the repository’s maintained Python module inventory.
Chunking interfaces, article-level planning, and semantic chunking implementation details.
lalandre_chunking¶
Source: packages/lalandre_chunking/lalandre_chunking/__init__.py
Chunking Service
- lalandre_chunking.get_chunker(*, embedding_provider, min_chunk_size, max_chunk_size, chunk_overlap, chars_per_token=3.3, breakpoint_percentile=90.0, breakpoint_max_threshold=1.0, sentence_window_size=1, embedding_batch_size=32)[source]¶
Factory for the SAC chunker.
- Parameters:
embedding_provider (EmbeddingProvider)
min_chunk_size (int)
max_chunk_size (int)
chunk_overlap (int)
chars_per_token (float)
breakpoint_percentile (float)
breakpoint_max_threshold (float)
sentence_window_size (int)
embedding_batch_size (int)
- Return type:
lalandre_chunking.chunker¶
Source: packages/lalandre_chunking/lalandre_chunking/chunker.py
Chunking base classes and helpers
- class lalandre_chunking.chunker.Chunker(min_chunk_size, max_chunk_size, chunk_overlap, chars_per_token=_DEFAULT_CHARS_PER_TOKEN)[source]¶
Bases:
ABCBase chunker interface.
- Parameters:
min_chunk_size (int)
max_chunk_size (int)
chunk_overlap (int)
chars_per_token (float)
lalandre_chunking.pipeline¶
Source: packages/lalandre_chunking/lalandre_chunking/pipeline.py
Keeps Postgres / Qdrant chunk artefacts in sync and provides a canonical dict representation of chunk records.
- lalandre_chunking.pipeline.serialize_chunk_records(chunks)[source]¶
Convert chunk ORM objects to plain dicts for
insert_chunk_records.- Parameters:
chunks (list[Any])
- Return type:
list[dict[str, Any]]
- class lalandre_chunking.pipeline.ArticleLevelPlan(active, skip_ids, article_content)[source]¶
Bases:
objectPre-computed plan for article-level chunking of an act.
- Parameters:
active (bool)
skip_ids (set[int])
article_content (dict[int, str])
- active¶
Whether article-level chunking applies for this act.
- Type:
bool
- skip_ids¶
IDs of paragraph subdivisions folded into their parent article.
- Type:
set[int]
- article_content¶
Mapping of article subdivision IDs to aggregated content.
- Type:
dict[int, str]
- lalandre_chunking.pipeline.prepare_article_level_plan(celex, subdivisions, article_level_enabled)[source]¶
Build an
ArticleLevelPlanfor the given act.- Parameters:
celex (str) – Normalized CELEX identifier of the act.
subdivisions (list[Any]) – Ordered subdivisions for the act, including child paragraphs.
article_level_enabled (bool) – Value of
config.chunking.article_level_chunking.
- Return type:
- lalandre_chunking.pipeline.make_article_level_chunks(*, chunker, subdivision, article_level_plan)[source]¶
Return one full-article chunk when article-level chunking applies.
- Parameters:
chunker (Any)
subdivision (Any)
article_level_plan (ArticleLevelPlan)
- Return type:
list[Any] | None
lalandre_chunking.sac_chunker¶
Source: packages/lalandre_chunking/lalandre_chunking/sac_chunker.py
Semantic Aware Chunking (SAC) for Legal Documents. Detects chunk boundaries by measuring embedding similarity between consecutive sentences: breaks happen where the meaning shifts the most.
- class lalandre_chunking.sac_chunker.SACChunker(embedding_provider, min_chunk_size, max_chunk_size, chunk_overlap, chars_per_token=3.3, breakpoint_percentile=90.0, breakpoint_max_threshold=1.0, sentence_window_size=1, batch_size=32)[source]¶
Bases:
ChunkerSemantic Aware Chunker.
Split text into sentences.
Embed each sentence (optionally using a sliding window).
Compute cosine similarity between consecutive embeddings.
Place breakpoints where similarity drops below a percentile threshold.
Group sentences between breakpoints, then enforce min/max size constraints.
- Parameters:
embedding_provider (EmbeddingProvider)
min_chunk_size (int)
max_chunk_size (int)
chunk_overlap (int)
chars_per_token (float)
breakpoint_percentile (float)
breakpoint_max_threshold (float)
sentence_window_size (int)
batch_size (int)