elastic ✦ jina.ai

Search Relevance · Embeddings

Late Chunking in Elasticsearch
A tool, not an upgrade

A measured look at when document-level context improves retrieval — and when it quietly makes it worse. Backed by real numbers, reproducible in Elastic.

Overview infographic: late chunking helps vs hurts across two corpora

The question

You chunk your documents. Then what?

elastic

Every RAG and semantic-search pipeline splits documents into chunks, then embeds each chunk into a vector. The conventional approach embeds each chunk in isolation.

A newer technique — late chunking — embeds the whole document first, then derives per-chunk vectors from that shared context.

The promise

Each chunk's vector "remembers" the surrounding document — pronouns, defined terms, and references resolve correctly.

The catch

It is widely presented as a universal upgrade. Our experiment shows it isn't. Sometimes it helps a lot. Sometimes it actively hurts.

Late Chunking in ElasticsearchIntroduced by Jina.ai, 2024

The concept

Two ways to turn chunks into vectors

elastic

Standard chunking

Embed each chunk alone

The model sees one chunk at a time. The vector reflects only that chunk's text — clean and self-contained.

Best when chunks already stand on their own.

Late chunking

Embed the document, then split

The model runs over the full document first. Attention lets every chunk absorb context from its neighbors before pooling.

Best when chunks depend on what surrounds them.

Standard vs. late chunkingSame model · different pooling moment

Our thesis

Late chunking is a specialized tool, not a universal upgrade.
We prove both sides with real retrieval numbers — measurable degradation on self-contained content, measurable improvement on cross-referenced content.

Two purpose-built corpora · One Elasticsearch deployment · One honest comparison

How it's built

End-to-end on Elastic — reproducible in a notebook

elastic

ArchitectureNotebook · Terraform · Jina v3 · Elastic Serverless · _rank_eval

Step 1 · Provision

Elastic Serverless, stood up by Terraform

elastic

Infrastructure as code — one terraform apply spins up a Serverless Elasticsearch project.
Secrets stay private — Elastic & Jina API keys live in an un-committed tfvars file.
Clean handoff — outputs land in a .env file, loaded by Python at runtime.
Fully tear-down-able — terraform destroy removes everything when the demo ends.

Step 1 · Create the environmentElastic Cloud Serverless · GCP

Step 2 · Embeddings

One model, one flag — native in Elasticsearch

elastic

Elastic inference endpoints call Jina jina-embeddings-v3 directly. Standard vs. late differ by a single parameter.

# the late-chunking endpoint
es.inference.put(
  task_type="text_embedding",
  inference_id="jina-embeddings-v3-late",
  body={
    "service": "jinaai",
    "service_settings": {"model_id":"jina-embeddings-v3"},
    "task_settings": {
      "late_chunking": True,
      "input_type": "search"
    }
  })

Two Elastic inference endpoints — standard and late_chunking:True — both backed by Jina embeddings v3

Step 2 · Inference endpointsjina-embeddings-v3 · late_chunking flag

Step 3 · The data

Two corpora, chosen to disagree

elastic

Corpus A — Product FAQ · 20 entries

Self-contained Q&A pairs. Neighbors are unrelated. Late chunking should hurt.

Corpus B — SEC 10-K Filings · 18 sections, 3 companies

Dense cross-references & defined terms ("the Company", "the Credit Facility"). Late chunking should help.

Four indices, identical mappings — only the embedding endpoint differs. Filing late-vectors are grouped per company so context never bleeds across filings.

Product FAQ and SEC 10-K documents loaded into faq_standard, faq_late, filing_standard, filing_late indices

Step 3 · Index the corpora4 Elastic indices · semantic_text + dense_vector

Step 4 · Measure

Objective scoring with `_rank_eval`

elastic

Elasticsearch's built-in _rank_eval API scores both strategies against a labeled judgment list — no external tooling.

MRR — how high the first correct answer ranks.
NDCG — quality of the whole top-5 ordering.

Standard indices use a semantic query; late indices embed the query then run knn. Evaluated at k = 5.

Step 4 · Retrieval & measurementMRR · NDCG · k=5

Result · Corpus A

Product FAQ — late chunking hurts

elastic

Metric	Standard	Late	Delta
MRR	1.0000	0.5857	−0.4143	Standard
NDCG	0.9868	0.6161	−0.3707	Standard

▼ ~40 pts Standard embedding is near-perfect; late chunking drags it down.

Why?

FAQ entries are atomic. When the whole document is embedded together, unrelated neighbors bleed in — a battery question contaminates a warranty question.

The clean, isolated signal that makes standard embedding perfect here gets diluted into noise.

Corpus A · Product FAQSelf-contained content → standard wins

Result · Corpus B

SEC 10-K filings — late chunking helps

elastic

Metric	Standard	Late	Delta
MRR	0.4028	0.7917	+0.3889	Late
NDCG	0.5539	0.8462	+0.2923	Late

▲ ~40 pts MRR nearly doubles — late chunking transforms a weak result into a strong one.

Why?

10-K sections aren't self-contained. Defined terms are set once and referenced throughout. In isolation, the meaning is lost.

Late chunking lets each section's vector resolve those cross-references against the full filing.

Corpus B · SEC 10-K filingsCross-referenced content → late wins

The takeaway

One question decides the strategy

elastic

Does a chunk make complete sense on its own?

Yes — self-contained

Each chunk stands alone.

→ Use standard chunking

Product catalogs & FAQs
News articles, reviews
Independent records

No — context-dependent

Chunks lean on what surrounds them.

→ Use late chunking

Financial filings & contracts
Technical manuals
Narrative / cross-referenced prose

The decision ruleMatch the strategy to the content

Why it matters for you

This runs in Elasticsearch — today

elastic

1

No new infrastructure. Both strategies are native Elastic inference endpoints — flip one flag, not a new platform.

2

Mix strategies per index. Route FAQ content to standard and filings to late — within one Elasticsearch deployment.

3

Prove it on your own data. _rank_eval gives you objective MRR / NDCG before you commit a strategy to production.

4

Open & reproducible. Terraform + a notebook — stand it up, measure, and tear it down in minutes.

Why ElasticLate chunking is a setting, not a migration

elastic ✦ jina.ai

Let's measure it on your corpus

Bring a sample of your documents. In a short working session we'll stand up Elastic Serverless, embed both ways, and let _rank_eval tell us which strategy wins for your content.

Run the demogithub.com/joeywhelan/late-chunking

Elastic inference endpointselastic.co · semantic_text & dense_vector

Late chunking originjina.ai · 2024 research blog

Late Chunking in ElasticsearchA tool, not an upgrade

You chunk your documents. Then what?

The promise

The catch

Two ways to turn chunks into vectors

Embed each chunk alone

Embed the document, then split

End-to-end on Elastic — reproducible in a notebook

Elastic Serverless, stood up by Terraform

One model, one flag — native in Elasticsearch

Two corpora, chosen to disagree

Corpus A — Product FAQ · 20 entries

Corpus B — SEC 10-K Filings · 18 sections, 3 companies

Objective scoring with _rank_eval

Product FAQ — late chunking hurts

Why?

SEC 10-K filings — late chunking helps

Why?

One question decides the strategy

Yes — self-contained

No — context-dependent

This runs in Elasticsearch — today

Let's measure it on your corpus

Late Chunking in Elasticsearch
A tool, not an upgrade

Objective scoring with `_rank_eval`