Chunking API

Note

This page is generated automatically from the repository’s maintained Python module inventory.

Chunking interfaces, article-level planning, and semantic chunking implementation details.

lalandre_chunking

Source: packages/lalandre_chunking/lalandre_chunking/__init__.py

Chunking Service

lalandre_chunking.get_chunker(*, embedding_provider, min_chunk_size, max_chunk_size, chunk_overlap, chars_per_token=3.3, breakpoint_percentile=90.0, breakpoint_max_threshold=1.0, sentence_window_size=1, embedding_batch_size=32)[source]

Factory for the SAC chunker.

Parameters:
  • embedding_provider (EmbeddingProvider)

  • min_chunk_size (int)

  • max_chunk_size (int)

  • chunk_overlap (int)

  • chars_per_token (float)

  • breakpoint_percentile (float)

  • breakpoint_max_threshold (float)

  • sentence_window_size (int)

  • embedding_batch_size (int)

Return type:

SACChunker

lalandre_chunking.chunker

Source: packages/lalandre_chunking/lalandre_chunking/chunker.py

Chunking base classes and helpers

class lalandre_chunking.chunker.Chunker(min_chunk_size, max_chunk_size, chunk_overlap, chars_per_token=_DEFAULT_CHARS_PER_TOKEN)[source]

Bases: ABC

Base chunker interface.

Parameters:
  • min_chunk_size (int)

  • max_chunk_size (int)

  • chunk_overlap (int)

  • chars_per_token (float)

abstractmethod chunk_subdivision(subdivision_id, content, subdivision_type=None)[source]

Chunk a subdivision’s content.

Parameters:
  • subdivision_id (int)

  • content (str)

  • subdivision_type (str | None)

Return type:

List[Chunks]

make_single_chunk(subdivision_id, content, metadata=None)[source]

Create a single chunk encompassing all content (no splitting).

Parameters:
  • subdivision_id (int)

  • content (str)

  • metadata (Dict[str, Any] | None)

Return type:

Chunks

lalandre_chunking.pipeline

Source: packages/lalandre_chunking/lalandre_chunking/pipeline.py

Keeps Postgres / Qdrant chunk artefacts in sync and provides a canonical dict representation of chunk records.

lalandre_chunking.pipeline.serialize_chunk_records(chunks)[source]

Convert chunk ORM objects to plain dicts for insert_chunk_records.

Parameters:

chunks (list[Any])

Return type:

list[dict[str, Any]]

class lalandre_chunking.pipeline.ArticleLevelPlan(active, skip_ids, article_content)[source]

Bases: object

Pre-computed plan for article-level chunking of an act.

Parameters:
  • active (bool)

  • skip_ids (set[int])

  • article_content (dict[int, str])

active

Whether article-level chunking applies for this act.

Type:

bool

skip_ids

IDs of paragraph subdivisions folded into their parent article.

Type:

set[int]

article_content

Mapping of article subdivision IDs to aggregated content.

Type:

dict[int, str]

lalandre_chunking.pipeline.prepare_article_level_plan(celex, subdivisions, article_level_enabled)[source]

Build an ArticleLevelPlan for the given act.

Parameters:
  • celex (str) – Normalized CELEX identifier of the act.

  • subdivisions (list[Any]) – Ordered subdivisions for the act, including child paragraphs.

  • article_level_enabled (bool) – Value of config.chunking.article_level_chunking.

Return type:

ArticleLevelPlan

lalandre_chunking.pipeline.make_article_level_chunks(*, chunker, subdivision, article_level_plan)[source]

Return one full-article chunk when article-level chunking applies.

Parameters:
Return type:

list[Any] | None

lalandre_chunking.sac_chunker

Source: packages/lalandre_chunking/lalandre_chunking/sac_chunker.py

Semantic Aware Chunking (SAC) for Legal Documents. Detects chunk boundaries by measuring embedding similarity between consecutive sentences: breaks happen where the meaning shifts the most.

class lalandre_chunking.sac_chunker.SACChunker(embedding_provider, min_chunk_size, max_chunk_size, chunk_overlap, chars_per_token=3.3, breakpoint_percentile=90.0, breakpoint_max_threshold=1.0, sentence_window_size=1, batch_size=32)[source]

Bases: Chunker

Semantic Aware Chunker.

  1. Split text into sentences.

  2. Embed each sentence (optionally using a sliding window).

  3. Compute cosine similarity between consecutive embeddings.

  4. Place breakpoints where similarity drops below a percentile threshold.

  5. Group sentences between breakpoints, then enforce min/max size constraints.

Parameters:
  • embedding_provider (EmbeddingProvider)

  • min_chunk_size (int)

  • max_chunk_size (int)

  • chunk_overlap (int)

  • chars_per_token (float)

  • breakpoint_percentile (float)

  • breakpoint_max_threshold (float)

  • sentence_window_size (int)

  • batch_size (int)

chunk_subdivision(subdivision_id, content, subdivision_type=None)[source]

Split one subdivision into semantic chunks with legal-text safeguards.

Parameters:
  • subdivision_id (int)

  • content (str)

  • subdivision_type (str | None)

Return type:

List[Chunks]