Optimize video semantic search intent with Amazon Nova Mannequin Distillation on Amazon Bedrock

Optimizing fashions for video semantic search requires balancing accuracy, price, and latency. Quicker, smaller fashions lack routing intelligence, whereas bigger, correct fashions add vital latency overhead. In Half 1 of this collection, we confirmed methods to construct a multimodal video semantic search system on AWS with clever intent routing utilizing the Anthropic Claude Haiku mannequin in Amazon Bedrock. Whereas the Haiku mannequin delivers robust accuracy for person search intent, it will increase end-to-end search time to 2-4 seconds. This contributes to 75% of the general latency.

Determine 1: An instance end-to-end question latency breakdown

Now think about what occurs because the routing logic grows extra advanced. Enterprise metadata will be much more advanced than the 5 attributes in our instance (title, caption, individuals, style, and timestamp). Prospects might consider digicam angles, temper and sentiment, licensing and rights home windows, and extra domain-specific taxonomies. Extra nuanced logic means a extra demanding immediate, and a extra demanding immediate results in costlier and slower responses. That is the place mannequin customization is available in. Fairly than selecting between a mannequin that’s quick however too easy or one which’s correct however too costly or too sluggish, we are able to obtain all three by coaching a small mannequin to carry out the duty precisely at a lot decrease latency and value.

On this publish, we present you methods to use Mannequin Distillation, a mannequin customization method on Amazon Bedrock, to switch routing intelligence from a big instructor mannequin (Amazon Nova Premier) right into a a lot smaller pupil mannequin (Amazon Nova Micro). This strategy cuts inference price by over 95% and reduces latency by 50% whereas sustaining the nuanced routing high quality that the duty calls for.

Resolution overview

We are going to stroll by means of the total distillation pipeline finish to finish in a Jupyter pocket book. At a excessive degree, the pocket book accommodates the next steps:

Put together coaching information — 10,000 artificial labeled examples utilizing Nova Premier and add the dataset to Amazon Easy Storage Service (Amazon S3) in Bedrock distillation format
Run distillation coaching job — Configure the job with instructor and pupil mannequin identifiers and submit through Amazon Bedrock
Deploy the distilled mannequin — Deploy the customized mannequin utilizing on-demand inference for versatile, pay-per-use entry
Consider the distilled mannequin — Examine routing high quality towards the bottom Nova Micro and the unique Claude Haiku baseline utilizing Amazon Bedrock Mannequin Analysis

The entire pocket book, coaching information technology script, and analysis utilities can be found within the GitHub repository.

Put together coaching information

One of many key causes we selected mannequin distillation over different customization strategies like supervised fine-tuning (SFT) is that it doesn’t require a totally labeled dataset. With SFT, each coaching instance wants a human-generated response as floor fact. With distillation, you solely want prompts. Amazon Bedrock robotically invokes the instructor mannequin to generate high-quality responses. It applies information synthesis and augmentation strategies behind the scenes to provide a various coaching dataset of as much as 15,000 prompt-response pairs.

That stated, you may optionally present a labeled dataset in order for you extra management over the coaching sign. Every document within the JSONL file follows the bedrock-conversation-2024 schema, the place the person position (the enter immediate) is required, and the assistant position (the specified response) is optionally available. See the next examples, and reference Put together your coaching datasets for distillation for extra element:

{
“schemaVersion”: “bedrock-conversation-2024”,
“system”: [{ “text”: “Return JSON with visual, audio, transcription, metadata weights (sum=1.0) and reasoning for the given video search query.” }],
“messages”: [
{
“role”: “user”,
“content”: [{ “text”: “Olivia talking about growing up in poverty” }]
},
{
“position”: “assistant”,
“content material”: [{ “text”: ” {“visual”: 0.2, “audio”: 0.1, “transcription”: 0.6, “metadata”: 0.1, “reasoning”: “The query focuses on spoken content (‘talking about’), making transcription most important. Visual and audio elements are secondary since they support the context, while metadata is minimal.”}”}]
}
]
}

For this publish, we ready 10,000 artificial labeled examples utilizing Nova Premier, the biggest and most succesful mannequin within the Nova household. The information was generated with a balanced distribution throughout visible, audio, transcription, and metadata sign queries, The examples cowl the total vary of anticipated search inputs, symbolize completely different problem ranges, embrace edge circumstances and variations, and forestall overfitting to slender question patterns. The next chart reveals the load distribution throughout the 4 modality channels.

Determine 2: The load distribution throughout the ten,000 coaching examples

For those who want further examples or wish to adapt the question distribution to your individual content material area, the offered generate_training_data.py script can be utilized to synthetically generate extra coaching information utilizing Nova Premier.

Run distillation coaching job

With the coaching information uploaded to Amazon S3, the following step is to submit the distillation job. Mannequin distillation works through the use of your prompts to first generate responses from the instructor mannequin. It then makes use of these prompt-response pairs to fine-tune the pupil mannequin. On this venture, the instructor is Amazon Nova Premier and the scholar is Amazon Nova Micro, a quick, cost-efficient mannequin optimized for high-throughput inference. The instructor’s routing choices turn into the coaching sign that shapes the scholar’s habits.

Amazon Bedrock manages all the coaching orchestration and infrastructure robotically. There isn’t any cluster provisioning, no hyperparameter tuning, and no teacher-to-student mannequin pipeline setup required. You specify the instructor mannequin, the scholar mannequin, the S3 path to your coaching information, and an AWS Identification and Entry Administration (IAM) position with the required permissions. Bedrock handles the remainder. The next is an instance code snippet to set off the distillation coaching job:

import boto3
from datetime import datetime

bedrock_client = boto3.shopper(service_name=”bedrock”)

teacher_model = “us.amazon.nova-premier-v1:0”
student_model = “amazon.nova-micro-v1:0:128k”

job_name = f”video-search-distillation-{datetime.now().strftime(‘%Y-%m-%d-%H-%M-%S’)}”
model_name = “nova-micro-video-router-v1″

response = bedrock_client.create_model_customization_job(
jobName=job_name,
customModelName=model_name,
roleArn=distillation_role_arn,
baseModelIdentifier=student_model,
customizationType=”DISTILLATION”,
trainingDataConfig={“s3Uri”: training_s3_uri},
outputDataConfig={“s3Uri”: output_s3_uri},
customizationConfig={
“distillationConfig”: {
“teacherModelConfig”: {
“teacherModelIdentifier”: teacher_model,
“maxResponseLengthForInference”: 1000
}
}
}
)

job_arn = response[‘jobArn’]

The job runs asynchronously. You’ll be able to monitor progress within the Amazon Bedrock console below Basis fashions > Customized fashions, or programmatically:

standing = bedrock_client.get_model_customization_job(
jobIdentifier=job_arn)[‘status’]
print(f”Job standing: {standing}”) # Coaching, Full, or Failed

Coaching time varies relying on the dataset dimension and the scholar mannequin chosen. For 10,000 labeled examples with Nova Micro, count on the job to finish inside a couple of hours.

Deploy the distilled mannequin

As soon as the distillation job is full, the customized mannequin is offered in your Amazon Bedrock account and able to deploy. Amazon Bedrock affords two deployment choices for customized fashions: Provisioned Throughput for predictable, high-volume workloads, and On-Demand Inference for versatile, pay-per-use entry with no upfront dedication.

For many groups getting began, on-demand inference is the advisable path. There isn’t any endpoint to provision, no hourly dedication, and no minimal utilization requirement. The next is the deployment code:

import uuid

deployment_name = f”nova-micro-video-router-{datetime.now().strftime(‘%Y-%m-%d’)}”

response = bedrock_client.create_custom_model_deployment(
modelDeploymentName=deployment_name,
modelArn=custom_model_arn,
description=”Distilled Nova Micro for video search modality weight prediction (4 weights)”,
tags=[
{“key”: “UseCase”, “value”: “VideoSearch”},
{“key”: “Version”, “value”: “v2-4weights”},
],
clientRequestToken=f”deployment-{uuid.uuid4()}”,
)

deployment_arn = response[‘modelDeploymentArn’]
print(f”Deployment ARN: {deployment_arn}”)

As soon as the standing reveals InService, you can invoke the distilled mannequin precisely as you’ll another base mannequin utilizing the usual InvokeModel or Converse API. You pay just for the tokens you devour at Nova Micro inference charges: $0.000035 per 1,000 enter tokens and $0.000140 per 1,000 output tokens.

import boto3
import json

bedrock_runtime = boto3.shopper(service_name=”bedrock-runtime”)

custom_model_arn = bedrock_client.get_model_customization_job(
jobIdentifier=job_arn
)[‘outputModelArn’]

response = bedrock_runtime.converse(
modelId=custom_model_arn,
messages=[
{
“role”: “user”,
“content”: [{“text”: query}]
}
]
)

routing_weights = json.masses(
response[‘output’][‘message’][‘content’][‘text’]
)
print(routing_weights)
# {“visible”: 0.7, “audio”: 0.1, “transcription”: 0.1, “metadata”: 0.1}

Consider the distilled mannequin

Earlier than evaluating towards the unique router, it’s price validating that distillation improved the bottom mannequin’s skill to comply with the routing job. The next desk reveals the identical immediate run by means of base Nova Micro and the distilled Nova Micro aspect by aspect.

Question
Distilled Nova Micro
Base Nova Micro

“CEO discussing quarterly earnings ”
{“visible”: 0.2, “audio”: 0.3, “transcription”: 0.4, “metadata”: 0.1, “reasoning”: “The question focuses on spoken content material (transcription) about earnings, however visible cues (CEO’s look) and audio (tone/readability) are additionally essential…”}

Right here’s a JSON illustration of the knowledge you requested for a video search question a couple of CEO discussing quarterly earnings:

“`json{ “video”: { “visible”: 0.3, “audio”: 0.3, “transcription”: 0.2, “metadata”: 0.1, “reasoning”: “The visible part contains the CEO’s pres….

“sundown over mountains”
{“visible”: 0.8, “audio”: 0.0, “transcription”: 0.0, “metadata”: 0.2, “reasoning”: “The question focuses on a visible scene (sundown over mountains), with no audio or transcription parts. Metadata may embrace location or time-related tags.”}

Right here’s a JSON illustration for a video search question “sundown over mountains” that features visible, audio, transcription, metadata weights (sum=1.0), and reasoning:

“`json{ “question”: “sundown over mountains”, “outcomes”: [ { “video_id”: “123456”, “visual”: 0.4, “audio”: 0.3 ….

The base model struggles with both instructions and output format consistency. It produces free-text responses, incomplete JSON, and non-numeric weight values. The distilled model consistently returns well-formed JSON with four numeric weights that sum to 1.0, matching the schema required by the routing pipeline.

Comparing against the original Claude Haiku router, both models are evaluated against a held-out set of 100 labeled examples generated by Nova Premier. We use Amazon Bedrock Model Evaluation to run the comparison in a structured, managed workflow. To assess routing quality beyond standard metrics, we defined a custom OverallQuality rubric (see the following code block) that instructs Claude Sonnet to score each prediction on two dimensions: weight accuracy against ground truth and reasoning quality. Each dimension maps to a concrete 5-point threshold, so the rubric penalizes both numerical drift and generic boilerplate reasoning.

“rating_scale”: [
{“definition”: “Weights within 0.05 of reference. Reasoning is specific and consistent.”,
“value”: {“floatValue”: 5.0}},
{“definition”: “Weights within 0.10 of reference. Reasoning is clear and mostly consistent.”,
“value”: {“floatValue”: 4.0}},
{“definition”: “Dominant modality matches. Avg error < 0.15. Reasoning is present but generic.”,
“value”: {“floatValue”: 3.0}},
{“definition”: “Dominant modality wrong OR avg error > 0.15. Reasoning vague or inconsistent.”,
“value”: {“floatValue”: 2.0}},
{“definition”: “Unparseable JSON, missing keys, or error > 0.30. No useful reasoning.”,
“value”: {“floatValue”: 1.0}},
]

The distilled Nova Micro mannequin achieved a big language mannequin (LLM)-as-judge rating of 4.0 out of 5, near-identical routing high quality to Claude 4.5 Haiku at roughly half the latency (833ms vs. 1,741ms). The associated fee benefit is equally vital. Switching to the distilled Nova Micro mannequin reduces inference prices by over 95% on each enter and output tokens, with no upfront commitments below on-demand pricing. Be aware: LLM-as-judge analysis is non-deterministic. Scores might range barely throughout runs.

Determine 3: Mannequin efficiency comparability (Distilled Nova Micro vs. Claude 4.5 Haiku)

The next is a desk abstract of side-by-side outcomes:

Metric
Distilled Nova Micro
Claude 4.5 Haiku

LLM-as-judge Rating
4.0 / 5
4.0 / 5

Imply Latency
833ms
1,741ms

Enter Token Price
$0.000035 / 1K
$0.80–$1.00 / 1K

Output Token Price
$0.000140 / 1K
$4.00–$5.00 / 1K

Output Format
Constant JSON
Inconsistent

Clear up

To keep away from ongoing expenses, run the cleanup part of the pocket book to take away any provisioned sources, together with deployed mannequin endpoints and any information saved in Amazon S3.

Conclusion

This publish is the second a part of a two-part collection. Constructing on Half 1, this publish focuses on making use of mannequin distillation to optimize the intent routing layer constructed within the video semantic search resolution. The strategies mentioned assist deal with actual manufacturing tradeoffs, comparable to balancing routing intelligence with latency and value at scale whereas sustaining search accuracy. By distilling Amazon Nova Premier’s routing habits into Amazon Nova Micro utilizing Amazon Bedrock Mannequin Distillation, we diminished inference price by over 95% and reduce preprocessing latency in half whereas preserving the nuanced routing high quality that the duty calls for. In case you are working multimodal video search at scale, mannequin distillation is a sensible path to production-grade price effectivity with out sacrificing search accuracy. To discover the total implementation, go to the GitHub repository and take a look at the answer your self.

Concerning the authors

Amit Kalawat

Amit Kalawat is a Principal Options Architect at Amazon Internet Providers based mostly out of New York. He works with enterprise clients as they remodel their enterprise and journey to the cloud.

James Wu

James Wu is a Principal GenAI/ML Specialist Options Architect at AWS, serving to enterprises design and execute AI transformation methods. Specializing in generative AI, agentic programs, and media provide chain automation, he’s a featured convention speaker and technical writer. Previous to AWS, he was an architect, developer, and expertise chief for over 10 years, with expertise spanning engineering and advertising and marketing industries.

Bimal Gajjar

Bimal Gajjar is a Senior Options Architect at AWS, the place he companions with International Accounts to design, undertake, and deploy scalable cloud storage and information options. With over 25 years of expertise working with main OEMs, together with HPE, Dell EMC, and Pure Storage, Bimal combines deep technical experience with strategic enterprise perception, drawn from end-to-end involvement in pre-sales structure and world service supply.

What's Hot

NYT Strands hints and solutions for Tuesday, Might 12 (sport #800)

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

FAQ on hantavirus and outbreak on cruise ship Hondius

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

College students Boo Graduation Speaker After She Calls AI the ‘Subsequent Industrial Revolution’

10 GitHub Repositories to Grasp FastAPI

Constructing internet search-enabled brokers with Strands and Exa

Understanding LLM Distillation Methods – MarkTechPost

Your AI Use Is Breaking My Mind

NYT Strands hints and solutions for Tuesday, Might 12 (sport #800)

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

FAQ on hantavirus and outbreak on cruise ship Hondius

NYT Strands hints and solutions for Tuesday, Might 12 (sport #800)

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

FAQ on hantavirus and outbreak on cruise ship Hondius

Usefull link

categories

What's Hot

Resolution overview

Put together coaching information

Run distillation coaching job

Deploy the distilled mannequin

Consider the distilled mannequin

Clear up

Conclusion

Concerning the authors

Amit Kalawat

James Wu

Bimal Gajjar

Related Posts

Usefull link

categories