Many Modalities,One Embedding Space

Jina v5 Omni on Elastic — multimodal and multilingual retrieval in a single index

The challenge

Search Stacks Multiply With Every Modality

Modern catalogs and customer signals arrive in text, photos, voice, and video.

Shoppers type, upload photos, record voice notes, and watch product video.
Classic stacks need a separate model and pipeline per modality — and another for cross-language coverage.
Integration, latency, embedding drift, GPU cost, and operational toil compound as modalities accumulate.
By the time you're shipping, the model landscape has moved again.

What if one embedding space handled all of it — inside your existing Elasticsearch query path?

Capability

Introducing Jina v5 Omni

A single model that embeds text, images, audio, and video into one vector space — across roughly 100 languages.

Text Image Audio Video ~100 languages 1024-dim (small)

Available today via the Elastic Inference Service (EIS)

Managed endpoint inside your Elastic project — no model hosting, no GPU provisioning on your side. Inference IDs: .jina-embeddings-v5-omni-small and .jina-embeddings-v5-omni-nano

Why it matters

One Embedding Space, One Query Path

The traditional stack

Text encoder + image encoder + speech model + video pipeline
Separate indices and ranking strategies per modality
Separate translation tier for non-English queries
Drift across models when any one is upgraded

With v5 Omni on Elastic

One model, one inference endpoint, four modalities
Standard semantic_text fields, standard query DSL
Cross-lingual coverage out of the box
Reuses your existing aggregations, filters, RBAC, snapshots

Demo architecture

Notebook-Driven, Fully Managed

Architecture: Jupyter notebook driving Terraform and Elasticsearch on Elastic Cloud Serverless

Jupyter drives Terraform → Elastic Cloud Serverless. EIS hosts the model. Ingest and query flow through standard Elasticsearch APIs.

The dataset

Small Synthetic Ecommerce Catalog

Six products (three headphones, three sneakers) and six reviews. Each product carries text, image, and optional video. Each review carries text, audio, and an optional evidence image.

Auralite X1 — Black

BassCore Pro Max

StrideFlex — Black

StrideFlex — White

Indexed across two indices: products and reviews.

Index design

One Inference ID, Many Fields

Each modality gets its own semantic_text field, all wired to the same EIS-managed inference endpoint. The mapping is strict; media bytes are sent as base64 data URIs.

"properties": {
  "description":          { "type": "text", "copy_to": "description_semantic" },
  "description_semantic": { "type": "semantic_text",
                            "inference_id": ".jina-embeddings-v5-omni-small" },
  "image_semantic":       { "type": "semantic_text",
                            "inference_id": ".jina-embeddings-v5-omni-small" },
  "video_semantic":       { "type": "semantic_text",
                            "inference_id": ".jina-embeddings-v5-omni-small" }
}

No bespoke ingest pipeline. EIS handles the embedding at index time and at query time.

Scenario 1

Text Query → All Review Modalities

One text query fans out to text_semantic, image_semantic, and audio_semantic.
Same embedding ranks each field independently — no transcription, no captioning.
Demonstrates the shared embedding space directly.

Text query: battery swelling and overheating

--- Review text ---
  0.878 r1 (hp_001): battery started swelling…
--- Review image ---
  0.609 r1 (hp_001): battery started swelling…
--- Review audio ---
  0.619 r2 (hp_001): amazing sound quality…

Scenario 1: text query fanning out to text, image, and audio review fields

Scenario 2

Audio Complaint → Review Text

Customer voice clip sent as a base64 data URI directly to text_semantic.
Model embeds the spoken complaint into the shared space — no Whisper sidecar, no transcription step.
Top results are all headphone reviews — the model correctly clusters the audio with related text.

Audio query: audio_complaint.mp3

  0.706 Auralite X1 (hp_001) — r2: sound quality…
  0.665 BassCore Pro Max (hp_003) — r3
  0.632 Auralite X1 (hp_001) — r1: battery defect

Scenario 2: audio complaint queried against the review text_semantic field

Scenario 3

Multilingual Text → English Reviews

Queries in Spanish, French, German match English review text directly — no translation tier.
One embedding space spans roughly 100 languages.
Cross-lingual coverage without paying for a separate translation pipeline.

Spanish: auriculares con problema de batería que se hincha
  0.869 r1 (hp_001): battery started swelling…

French: la couleur des chaussures s'est décolorée…
  0.851 r4 (sn_001): color faded after a few runs

German: großartiger Klang und tiefer Bass
  0.846 r2 (hp_001): deep bass, sound quality

Scenario 3: multilingual queries against English review text

Operations

What Your Platform Team Inherits

Elastic Inference Service hosts the model. No GPU procurement, no inference container ops, no model rollouts on your side.
Single semantic_text query pattern for every modality — your existing search code learns one new field type.
Standard Elastic stack still applies: filters, aggregations, vector + BM25 hybrid, RBAC, snapshots, ILM, Kibana observability.
Terraform-managed projects — the same code path from POC to staging to production.
Multi-cloud, serverless-first — sized to demand, paid by usage, no clusters to babysit.

Business outcome

Why Customers Choose This Path

Time-to-market

Stand up a multimodal search prototype in days, not quarters. The same notebook in this demo is the starting point.

Lower TCO

One embedding stack and one inference contract replace separate per-modality model pipelines.

Wider market reach

~100-language coverage without translation infrastructure — open up regional markets without rebuilding search.

Future-proofing

Model evolution managed by Elastic and Jina. Swap the inference ID, keep the index and query code.

Next steps

Resources & Q&A

Everything in this deck is reproducible. The full demo runs end-to-end from one notebook.

Demo repository

github.com/joeywhelan/omnishop

Companion article

Many Modalities, One Embedding Space — full walkthrough

Elastic Inference Service

elastic.co/docs/explore-analyze/elastic-inference/eis

Elastic Cloud Serverless

elastic.co/cloud/serverless

Let's talk about the modalities in your search workload.