Video semantic search is unlocking new worth throughout industries. The demand for video-first experiences is reshaping how organizations ship content material, and prospects anticipate quick, correct entry to particular moments inside video. For instance, sports activities broadcasters must floor the precise second a participant scored to ship spotlight clips to followers immediately. Studios want to seek out each scene that includes a particular actor throughout 1000’s of hours of archived content material to create customized trailers and promotional content material. Information organizations must retrieve footage by temper, location, or occasion to publish breaking tales sooner than opponents. The aim is similar: ship video content material to finish customers shortly, seize the second, and monetize the expertise.
Video is of course extra advanced than different modalities like textual content or picture as a result of it amalgamates a number of unstructured indicators: the visible scene unfolding on display, the ambient audio and sound results, the spoken dialogue, the temporal info, and the structured metadata describing the asset. A person trying to find “a tense automobile chase with sirens” is asking a couple of visible occasion and an audio occasion on the identical time. A person trying to find a particular athlete by title could also be in search of somebody who seems prominently on display however isn’t spoken aloud.
The dominant method right now grounds all video indicators into textual content, whether or not by means of transcription, handbook tagging, or automated captioning, after which applies textual content embeddings for search. Whereas this works for dialogue-heavy content material, changing video to textual content inevitably loses vital info. Temporal understanding disappears, and transcription errors emerge from visible and audio high quality points. What in case you had a mannequin that would course of all modalities and immediately map them right into a single searchable illustration with out dropping element? Amazon Nova Multimodal Embeddings is a unified embedding mannequin that natively processes textual content, paperwork, pictures, video, and audio right into a shared semantic vector area. It delivers main retrieval accuracy and value effectivity.
On this submit, we present you the best way to construct a video semantic search answer on Amazon Bedrock utilizing Nova Multimodal Embeddings that intelligently understands person intent and retrieves correct video outcomes throughout all sign sorts concurrently. We additionally share a reference implementation you’ll be able to deploy and discover with your personal content material.
Determine 1: Instance screenshot from closing search answer
Answer overview
We constructed our answer on Nova Multimodal Embeddings mixed with an clever hybrid search structure that fuses semantic and lexical indicators throughout all video modalities. Lexical search matches actual key phrases and phrases, whereas semantic search understands which means and context. We are going to clarify our alternative of this hybrid method and its efficiency advantages in later sections.
Determine 2: Finish-to-end answer structure
The structure consists of two phases: an ingestion pipeline (steps 1-6) that processes video into searchable embeddings, and a search pipeline (steps 7-10) that routes person queries intelligently throughout these representations and merges outcomes right into a ranked checklist. Listed here are particulars for every of the steps:
- Add – Movies uploaded by way of browser are saved in Amazon Easy Storage Service (Amazon S3), triggering the Orchestrator AWS Lambda to replace Amazon DynamoDB standing and begin the AWS Step Features pipeline
- Shot segmentation – AWS Fargate makes use of FFmpeg scene detection to separate video into semantically coherent segments
- Parallel processing – Three concurrent branches course of every phase:
- Embeddings: Nova Multimodal Embeddings generates 1024-dimensional vectors for visible and audio, saved in Amazon S3 Vectors
- Transcription: Amazon Transcribe converts speech to textual content, aligned to segments. Amazon Nova Multimodal Embeddings generates textual content embeddings saved in Amazon S3 Vectors
- Movie star detection: Amazon Rekognition identifies recognized people, mapped to segments by timestamp
- Caption & style era – Amazon Nova 2 Lite synthesizes segment-level captions and style labels from visible content material and transcripts
- Merge – AWS Lambda assembles all metadata (captions, transcripts, celebrities, style) and retrieves embeddings from Amazon S3 Vectors
- Index – Full phase paperwork with metadata and vectors which might be bulk-indexed into Amazon OpenSearch Service
- Authentication – Customers authenticate by way of Amazon Cognito and entry the entrance finish by means of Amazon CloudFront
- Question processing – Amazon API Gateway routes requests to Search Lambda, which executes two parallel operations: intent evaluation and question embedding
- Intent evaluation – Amazon Bedrock (utilizing Anthropic Claude Haiku) assigns relevance weights (0.0-1.0) throughout visible, audio, transcription, and metadata modalities
- Question embedding – Nova Multimodal Embeddings embeds the question 3 times for visible, audio, and transcription similarity search
This versatile structure addresses 4 key design selections that the majority video search techniques overlook: sustaining temporal context, dealing with multimodal queries, scaling throughout huge content material libraries, and optimizing retrieval accuracy. An entire reference implementation is accessible on GitHub, and we encourage you to observe together with the next walkthrough to see how every choice contributes to correct, scalable search throughout all sign sorts.
Segmentation for context continuity
Earlier than producing any embeddings, it is advisable divide your video into searchable models, and the boundaries you draw have a direct impression on search accuracy. Every phase turns into the atomic unit of retrieval. If a phase is just too quick, it loses the encircling context that offers a second its which means. Whether it is too lengthy, it fuses a number of matters or scenes collectively, diluting relevance and making it tougher for the search system to floor the fitting second. For simplicity, you can begin with fixed-length chunks. Nova Multimodal Embeddings helps as much as 30 seconds per embedding, supplying you with flexibility to seize full scenes. Nonetheless, bear in mind that mounted boundaries might arbitrarily truncate a scene mid-action or cut up a sentence mid-thought, disrupting the semantic which means that makes a second retrievable, as proven within the following determine.
Determine 3: Video segmentation methods
The aim is semantic continuity: every phase ought to characterize a coherent unit of which means relatively than an arbitrary slice of time. Fastened 10-second blocks are simple to provide, however they ignore the pure construction of the content material. A scene change mid-segment splits a visible concept throughout two chunks, degrading each retrieval precision and embedding high quality.
To resolve this, we use FFmpeg‘s scene detection to establish the place the visible content material really modifications. FFmpeg is an open supply multimedia framework extensively used for video processing, format conversion, and evaluation. The _detect_scenes perform that follows runs ffprobe (FFmpeg’s related software for media inspection) in opposition to the video and returns an inventory of timestamps, every marking a scene boundary:
def _detect_scenes(video_path):
end result = subprocess.run(
[‘ffprobe’, ‘-v’, ‘quiet’, ‘-show_entries’, ‘frame=pts_time’, ‘-of’, ‘csv=p=0’,
‘-f’, ‘lavfi’, f”movie={video_path},select=”gt(scene,{SCENE_THRESHOLD})””],
capture_output=True, textual content=True
)
The output is an easy checklist of timestamps like 12.345, 28.901, 45.678, every marking a pure boundary the place the scene shifts.
With these boundaries in hand, the segmentation algorithm snaps every reduce to the closest scene change inside a suitable window, focusing on round 10 seconds with a minimal of 5 seconds and a most of 15 seconds from the present begin. If no scene modifications fall in that vary, it falls again to a tough reduce on the goal period. The result’s a set of segments that really feel pure: 8.3s, 11.1s, 9.8s, 12.4s, 7.6s, every aligned to an actual scene boundary relatively than a set ticker.
This straightforward shot-based segmentation makes certain phase boundaries align with pure visible transitions relatively than chopping arbitrarily. The goal phase period needs to be calibrated primarily based in your content material sort and use case: action-heavy content material with frequent cuts might profit from visible segmentation like this, whereas documentary or interview content material with longer takes may match higher with longer, topic-based segmentation. For extra superior segmentation strategies, together with audio-based matter segmentation and mixed visible and audio approaches, we suggest studying Media2Cloud on AWS Steering: Scene and Ad-Break Detection and Contextual Understanding for Promoting Utilizing Generative AI.
Generate separate embeddings for visible, audio, and transcript indicators
With segments outlined, the selection of embedding mannequin is the place the biggest high quality hole opens between approaches. The dominant method right now grounds all video indicators into textual content earlier than producing embeddings, however as we established earlier, video carries much more which means than any transcript or caption can categorical. Visible motion, ambient sound, on-screen textual content, and entity context are both misplaced fully or approximated by means of imprecise descriptions.
Nova Multimodal Embeddings modifications this basically as a result of it’s a video-native mannequin that may generate embeddings in two modes. The mixed mode fuses visible and audio indicators right into a unified illustration, capturing crucial indicators collectively. This method advantages storage value and retrieval latency by requiring solely a single embedding per phase. Alternatively, the AUDIO_VIDEO_SEPARATE mode generates distinct visible and audio embeddings. This method supplies most illustration in modality-specific embeddings and provides you higher management over when to go looking visible content material versus audio content material.
In our implementation, we even added a 3rd speech embedding derived from Amazon Transcribe. This embedding is created from aligning full sentence transcripts to the embedding phase timestamps, earlier than and after, preserving the semantic integrity of spoken language and making certain {that a} full thought isn’t cut up throughout two embeddings.
Determine 4: Visible, audio, and speech embedding era per video phase
Collectively, these three embeddings cowl the complete sign area of a video phase. The visible embedding captures what the digital camera sees: objects, scenes, actions, colours, and spatial composition. The audio embedding captures what the microphone hears: music, sound results, ambient noise, and the acoustic texture of a scene. The transcript embedding captures what individuals say, representing the semantic which means of spoken dialogue and narration. Collapsing all three indicators right into a single mixed embedding compresses distinct modalities into one vector. This blurs the boundaries between what’s seen, heard, and spoken, and loses the fine-grained element that makes every sign helpful by itself. Conserving them separate provides you exact management to dial every modality up or down primarily based on question intent, permitting the search pipeline to match in opposition to the modality more than likely to include the reply.
Mix metadata and embeddings for hybrid search
Even with three unbiased embeddings masking visible, audio, and spoken content material, there may be nonetheless a category of queries the system can’t reply properly. Embeddings are designed to seize semantic similarity. They excel at discovering a “tense crowd second” or a “solar setting over water” as a result of these are ideas with wealthy visible and audio which means. However when a person searches for a particular title, product mannequin quantity, geolocation, or a selected date, embeddings will probably fail. These are discrete entities with little semantic indicators on their very own. That is the place hybrid search is available in. Quite than counting on embeddings alone, the system runs two parallel retrieval paths as proven within the following determine: a semantic path that matches in opposition to your visible, audio, and transcript embeddings to seize conceptual similarity, and a lexical path that performs actual key phrase and entity matching in opposition to structured metadata.
Determine 5: Hybrid search pipeline combining semantic and lexical retrieval
How a lot metadata do you want? The reply depends upon your content material sort, group, and use case, and capturing every thing upfront is impractical. For illustration functions, we chosen just a few classes of metadata to characterize frequent forms of metadata in media and leisure content material.
First, we chosen video title and datetime to characterize technical metadata extracted immediately from the content material catalog or file metadata. Then we added phase captions, style, and movie star recognition to characterize contextual metadata, generated utilizing Amazon Nova 2 Lite and Amazon Rekognition. Captions are generated from the video and transcript of every phase, giving the mannequin each visible and spoken context. Style is predicted from the complete video transcript throughout all segments, which is cheaper and extra dependable than re-sending all video clips. Movie star identification is dealt with by Amazon Rekognition, which acknowledges recognized public figures showing on display with out requiring customized coaching.
Instance prompts used for caption era and style classification are proven within the following examples:
# Caption era
Describe this video clip in 3-5 sentences. Embrace:
– What is occurring, who’s seen, actions, setting, and atmosphere
– Any textual content on display: titles, subtitles, indicators, logos, watermarks, or credit
– If the display is usually black or clean, state “Black body” or “Clean display”
Transcription: {segment_transcript}
Return ONLY the descriptive caption, nothing else.
# Style classification
Primarily based on all of the video segments described beneath, classify the general video
into precisely ONE style from this checklist: Sports activities, Information, Leisure,
Documentary, Schooling, Music, Gaming, Cooking, Journey, Expertise,
Enterprise, Way of life, Sci-Fi, Thriller, Different
Section descriptions:
{all_captions}
Return ONLY the style title, nothing else.
The idea extends naturally to different metadata sorts. Technical metadata might embody decision or file measurement, whereas contextual metadata would possibly embody location, temper, or model. The proper steadiness depends upon your search use case. Moreover, overlaying metadata filters throughout retrieval can additional improve search scalability and accuracy by narrowing the search area earlier than semantic matching.
Optimize search relevance with intent-aware question routing
Now you might have three embeddings and metadata, 4 searchable dimensions. However how are you aware when to make use of which for a given question? Intent is every thing. To resolve this, we constructed an clever intent evaluation router that makes use of the Haiku mannequin to investigate every incoming question and assign weight to every modality channel: visible, audio, transcript, and metadata. See the instance search question within the following determine.
“Kevin taking a cellphone name subsequent to a classic automobile”
Determine 6: Instance question with intelligently weights assigned primarily based on search intent
The Haiku mannequin is prompted to return a JSON object with weights that sum to 1.0, together with a short reasoning hint explaining the task. See the next immediate:
Analyze this video search question and assign weights (0.0–1.0) for 4 modalities.
Weights should sum to 1.0.
Return ONLY legitimate JSON on this actual format:
{“visible”: 0.0, “audio”: 0.0, “transcription”: 0.0, “metadata”: 0.0, “reasoning”: “…”}
Tips:
– visible: look, colours, objects, actions, scenes
– audio: sounds, music, noise, non-speech audio
– transcription: spoken phrases, dialogue, narration
– metadata: particular person names, style, captions, factual attributes
Examples:
– “crimson automobile driving” → visible=0.9, metadata=0.1
– “particular person saying whats up” → transcription=0.5, visible=0.2, audio=0.2, metadata=0.1
– “canine barking loudly” → audio=0.6, visible=0.3, metadata=0.1
The weights immediately management which sub-queries execute. Any modality beneath a 5% weight threshold is skipped fully, eliminating pointless embedding API calls and lowering search latency with out sacrificing accuracy. The remaining channels execute in parallel, every looking out its personal index independently. Outcomes from all energetic channels are then scored utilizing a weighted arithmetic imply. BM25 scores (a lexical relevance measure primarily based on time period frequency and doc size) and cosine similarity scores (a geometrical measure of how intently two embedding vectors level in the identical route) stay on very totally different scales. To deal with this, every sub-query’s scores are first normalized to a 0-1 vary, then mixed utilizing the router’s intent weights:
final_score = w₁ × norm_bm25 + w₂ × norm_visual + w₃ × norm_audio + w₄ × norm_transcription
We selected the weighted arithmetic imply as our reranking approach as a result of it immediately incorporates question intent by means of the router’s weights. In contrast to Reciprocal Rank Fusion (RRF), which treats all energetic channels equally no matter intent, the weighted imply amplifies channels the router deems most related for a given question. From our testing, this produced extra correct outcomes for our search duties.
Select the fitting storage technique for vectors and metadata
The ultimate design choice is the place and the best way to retailer all of it. Every video phase produces as much as three embeddings and a set of metadata fields, and the way you retailer them determines each your search efficiency and your value at scale. We cut up this throughout two companies with complementary roles: Amazon S3 Vectors for vector storage, and Amazon OpenSearch Service for hybrid search.
S3 Vectors shops three vector indices per venture, one for every embedding sort:
- nova-visual-{project_id} # visible embeddings
- nova-audio-{project_id} # audio embeddings
- nova-transcription-{project_id} # transcript embeddings
OpenSearch holds one index per venture, the place every doc represents a single video phase containing each textual content fields for BM25 search and vector fields for k-nearest neighbors (kNN) search:
{
“_id”: “f953ceba_seg_0012”,
“start_sec”: 118.45,
“end_sec”: 128.72,
“caption”: “A presenter walks by means of a rice paddy in rural Jakarta, discussing how rice cultivation has formed native civilization for 1000’s of years.”,
“individuals”: [“presenter_name”],
“style”: “Documentary”,
“visual_vector”: [0.023, -0.118, 0.045, …],
“audio_vector”: [0.045, 0.091, -0.033, …],
“transcription_vector”: [-0.067, 0.134, 0.012, …]
}
We selected S3 Vectors for its cost-to-performance advantages. Amazon S3 Vectors reduces the price for storing and querying vectors by as much as 90% in comparison with different specialised options. If search latency shouldn’t be vital in your use case, S3 Vectors is a robust default alternative. If you happen to want the bottom potential latency, we suggest utilizing vectors in reminiscence with the OpenSearch Hierarchical Navigable Small World (HNSW) engine.
Lastly, it’s price calling out that some use instances require looking out inside longer, semantically dense video segments equivalent to a full interview, a multi-minute documentary scene, or an prolonged product demonstration. Most multimodal embedding fashions, together with Nova Multimodal Embeddings, have a most enter period of 30 seconds, which implies a 3-minute clip can’t be embedded as a single unit. Trying to take action would both fail or drive chunking that loses the broader context.
The nested vector help in OpenSearch addresses this by permitting a single doc to include a number of sub-segment embeddings:
{
“_id”: “f953ceba_scene_003”,
“start_sec”: 118.45,
“end_sec”: 298.10,
“sub_segments”: [
{ “start_sec”: 118.45, “end_sec”: 128.72, “visual_vector”: […] },
{ “start_sec”: 128.72, “end_sec”: 139.10, “visual_vector”: […] },
{ “start_sec”: 139.10, “end_sec”: 150.30, “visual_vector”: […] }
]
}
At question time, OpenSearch scores the doc primarily based on the best-matching sub-segment relatively than a single averaged illustration, so an extended scene can match a particular visible second inside it whereas nonetheless being returned as one coherent end result.
Efficiency outcomes: How the optimized method outperforms the baseline
To validate our design selections, we benchmarked the optimized hybrid search in opposition to Nova Multimodal Embeddings baseline AUDIO_VIDEO_COMBINED mode utilizing 10 inside long-form movies (5-20 minutes) evaluated throughout 20 queries spanning visible, audio, transcript, and metadata-focused searches. The baseline makes use of a single unified vector per 10-second phase with one index and one kNN question. Our optimized method generates separate visible, audio, and transcript embeddings, enriches segments with structured metadata, and applies intent-aware routing that dynamically weights modality channels. The next determine exhibits outcomes throughout 4 normal retrieval metrics:
Determine 7: Efficiency Comparability Throughout Retrieval Metrics for Hybrid Search with Nova MME vs. Baseline
The next desk captures key metrics:
Recall@5
Recall@10
MRR
NDCG@10
Hybrid search W/ Nova Multimodal Embeddings
90%
95%
90%
88%
Baseline
51%
64%
48%
54%
Key metrics defined:
- Recall@5: Of all related segments, what fraction seems within the prime 5 outcomes? This implies the protection of the search outcomes.
- Recall@10: Of all related segments, what fraction seems within the prime 10 outcomes? This implies the protection of the search outcomes.
- MRR (Imply Reciprocal Rank): 1/rank of the primary related end result, averaged throughout queries. This measures how shortly you discover one thing related.
- NDCG@10: Normalized Discounted Cumulative Acquire rewards related outcomes ranked larger and penalizes these ranked decrease. It is a normal rating high quality metric.
The outcomes present substantial enhancements throughout all metrics. The optimized hybrid search achieved 90+% Recall@5 and Recall@10 versus 51% and 64% for the baseline (~40% carry on protection accuracy). MRR jumped from 48% to 90%, and NDCG@10 rose from 54% to 88%. These 30-40 proportion level good points validate our core architectural selections: semantic segmentation preserves content material continuity, separate embeddings present exact search management, metadata enrichment captures factual entities, and intent-aware routing makes certain the fitting indicators drive every question. By treating every modality independently whereas intelligently combining them primarily based on question intent, the system adapts to various search patterns and delivers constantly related outcomes as your video archive scales.
Clear up
To keep away from incurring future costs, delete the sources used on this answer by eradicating the AWS CloudFormation stack. For detailed instructions, discuss with the GitHub repository.
Conclusion
On this submit, we confirmed the best way to construct a video semantic search answer on AWS utilizing Nova Multimodal Embeddings, masking 4 key design selections: segmentation for semantic continuity, multimodal embeddings that seize visible, audio, and speech indicators independently, metadata that fills the precision hole for entity-specific queries, and an information construction that organizes every thing for environment friendly retrieval at scale. Along with an clever intent evaluation router and weighted reranking, these selections rework a fragmented set of indicators right into a unified, correct search expertise that understands video. Extra optimizations will be performed to additional tune search accuracy, together with mannequin customization for the intent routing layer. Learn Half 2 to go deeper on these strategies. For a production-ready implementation of this video search and metadata administration approach at scale, see the Steering for a Media Lake on AWS.
In regards to the authors
Amit Kalawat
Amit Kalawat is a Principal Options Architect at Amazon Net Companies primarily based out of New York. He works with enterprise prospects as they rework their enterprise and journey to the cloud.
James Wu
James Wu is a Principal GenAI/ML Specialist Options Architect at AWS, serving to enterprises design and execute AI transformation methods. Specializing in generative AI, agentic techniques, and media provide chain automation, he’s a featured convention speaker and technical writer. Previous to AWS, he was an architect, developer, and know-how chief for over 10 years, with expertise spanning engineering and advertising industries.
Bimal Gajjar
Bimal Gajjar is a Senior Options Architect at AWS, the place he companions with World Accounts to design, undertake, and deploy scalable cloud storage and knowledge options. With over 25 years of expertise working with main OEMs, together with HPE, Dell EMC, and Pure Storage, Bimal combines deep technical experience with strategic enterprise perception, drawn from end-to-end involvement in pre-sales structure and international service supply.

