Chunking API¶

Note

This page is generated automatically from the repository’s maintained Python module inventory.

Chunking interfaces, article-level planning, and semantic chunking implementation details.

`lalandre_chunking`¶

Source: packages/lalandre_chunking/lalandre_chunking/__init__.py

Chunking Service

lalandre_chunking.get_chunker(*, embedding_provider, min_chunk_size, max_chunk_size, chunk_overlap, chars_per_token=3.3, breakpoint_percentile=90.0, breakpoint_max_threshold=1.0, sentence_window_size=1, embedding_batch_size=32)[source]¶

Factory for the SAC chunker.

Parameters:

embedding_provider (EmbeddingProvider)
min_chunk_size (int)
max_chunk_size (int)
chunk_overlap (int)
chars_per_token (float)
breakpoint_percentile (float)
breakpoint_max_threshold (float)
sentence_window_size (int)
embedding_batch_size (int)

Return type:

SACChunker

`lalandre_chunking.chunker`¶

Source: packages/lalandre_chunking/lalandre_chunking/chunker.py

Chunking base classes and helpers

class lalandre_chunking.chunker.Chunker(min_chunk_size, max_chunk_size, chunk_overlap, chars_per_token=_DEFAULT_CHARS_PER_TOKEN)[source]¶

Bases: ABC

Base chunker interface.

Parameters:

min_chunk_size (int)
max_chunk_size (int)
chunk_overlap (int)
chars_per_token (float)

abstractmethod chunk_subdivision(subdivision_id, content, subdivision_type=None)[source]¶

Chunk a subdivision’s content.

Parameters:

subdivision_id (int)
content (str)
subdivision_type (str | None)

Return type:

List[Chunks]

make_single_chunk(subdivision_id, content, metadata=None)[source]¶

Create a single chunk encompassing all content (no splitting).

Parameters:

subdivision_id (int)
content (str)
metadata (Dict[str, Any] | None)

Return type:

Chunks

`lalandre_chunking.pipeline`¶

Source: packages/lalandre_chunking/lalandre_chunking/pipeline.py

Keeps Postgres / Qdrant chunk artefacts in sync and provides a canonical dict representation of chunk records.

lalandre_chunking.pipeline.serialize_chunk_records(chunks)[source]¶

Convert chunk ORM objects to plain dicts for insert_chunk_records.

Parameters:: chunks (list[Any])
Return type:: list[dict[str, Any]]

class lalandre_chunking.pipeline.ArticleLevelPlan(active, skip_ids, article_content)[source]¶

Bases: object

Pre-computed plan for article-level chunking of an act.

Parameters:

active (bool)
skip_ids (set[int])
article_content (dict[int, str])

active¶

Whether article-level chunking applies for this act.

Type:: bool

skip_ids¶

IDs of paragraph subdivisions folded into their parent article.

Type:: set[int]

article_content¶

Mapping of article subdivision IDs to aggregated content.

Type:: dict[int, str]

lalandre_chunking.pipeline.prepare_article_level_plan(celex, subdivisions, article_level_enabled)[source]¶

Build an ArticleLevelPlan for the given act.

Parameters:

celex (str) – Normalized CELEX identifier of the act.
subdivisions (list[Any]) – Ordered subdivisions for the act, including child paragraphs.
article_level_enabled (bool) – Value of config.chunking.article_level_chunking.

Return type:

ArticleLevelPlan

lalandre_chunking.pipeline.make_article_level_chunks(*, chunker, subdivision, article_level_plan)[source]¶

Return one full-article chunk when article-level chunking applies.

Parameters:

chunker (Any)
subdivision (Any)
article_level_plan (ArticleLevelPlan)

Return type:

list[Any] | None

`lalandre_chunking.sac_chunker`¶

Source: packages/lalandre_chunking/lalandre_chunking/sac_chunker.py

Semantic Aware Chunking (SAC) for Legal Documents. Detects chunk boundaries by measuring embedding similarity between consecutive sentences: breaks happen where the meaning shifts the most.

class lalandre_chunking.sac_chunker.SACChunker(embedding_provider, min_chunk_size, max_chunk_size, chunk_overlap, chars_per_token=3.3, breakpoint_percentile=90.0, breakpoint_max_threshold=1.0, sentence_window_size=1, batch_size=32)[source]¶

Bases: Chunker

Semantic Aware Chunker.

Split text into sentences.
Embed each sentence (optionally using a sliding window).
Compute cosine similarity between consecutive embeddings.
Place breakpoints where similarity drops below a percentile threshold.
Group sentences between breakpoints, then enforce min/max size constraints.

Parameters:

embedding_provider (EmbeddingProvider)
min_chunk_size (int)
max_chunk_size (int)
chunk_overlap (int)
chars_per_token (float)
breakpoint_percentile (float)
breakpoint_max_threshold (float)
sentence_window_size (int)
batch_size (int)

chunk_subdivision(subdivision_id, content, subdivision_type=None)[source]¶

Split one subdivision into semantic chunks with legal-text safeguards.

Parameters:

subdivision_id (int)
content (str)
subdivision_type (str | None)

Return type:

List[Chunks]

Chunking API¶

lalandre_chunking¶

lalandre_chunking.chunker¶

lalandre_chunking.pipeline¶

lalandre_chunking.sac_chunker¶

`lalandre_chunking`¶

`lalandre_chunking.chunker`¶

`lalandre_chunking.pipeline`¶

`lalandre_chunking.sac_chunker`¶