Optimizing fashions for video semantic search requires balancing accuracy, price, and latency. Quicker, smaller fashions lack routing intelligence, whereas bigger, correct fashions add vital latency overhead. In Half 1 of this collection, we confirmed methods to construct a multimodal video semantic search system on AWS with clever intent routing utilizing the Anthropic Claude Haiku mannequin in Amazon Bedrock. Whereas the Haiku mannequin delivers robust accuracy for person search intent, it will increase end-to-end search time to 2-4 seconds. This contributes to 75% of the general latency.
Determine 1: An instance end-to-end question latency breakdown
Now think about what occurs because the routing logic grows extra advanced. Enterprise metadata will be much more advanced than the 5 attributes in our instance (title, caption, individuals, style, and timestamp). Prospects might consider digicam angles, temper and sentiment, licensing and rights home windows, and extra domain-specific taxonomies. Extra nuanced logic means a extra demanding immediate, and a extra demanding immediate results in costlier and slower responses. That is the place mannequin customization is available in. Fairly than selecting between a mannequin that’s quick however too easy or one which’s correct however too costly or too sluggish, we are able to obtain all three by coaching a small mannequin to carry out the duty precisely at a lot decrease latency and value.
On this publish, we present you methods to use Mannequin Distillation, a mannequin customization method on Amazon Bedrock, to switch routing intelligence from a big instructor mannequin (Amazon Nova Premier) right into a a lot smaller pupil mannequin (Amazon Nova Micro). This strategy cuts inference price by over 95% and reduces latency by 50% whereas sustaining the nuanced routing high quality that the duty calls for.
Resolution overview
We are going to stroll by means of the total distillation pipeline finish to finish in a Jupyter pocket book. At a excessive degree, the pocket book accommodates the next steps:
- Put together coaching information — 10,000 artificial labeled examples utilizing Nova Premier and add the dataset to Amazon Easy Storage Service (Amazon S3) in Bedrock distillation format
- Run distillation coaching job — Configure the job with instructor and pupil mannequin identifiers and submit through Amazon Bedrock
- Deploy the distilled mannequin — Deploy the customized mannequin utilizing on-demand inference for versatile, pay-per-use entry
- Consider the distilled mannequin — Examine routing high quality towards the bottom Nova Micro and the unique Claude Haiku baseline utilizing Amazon Bedrock Mannequin Analysis
The entire pocket book, coaching information technology script, and analysis utilities can be found within the GitHub repository.
Put together coaching information
One of many key causes we selected mannequin distillation over different customization strategies like supervised fine-tuning (SFT) is that it doesn’t require a totally labeled dataset. With SFT, each coaching instance wants a human-generated response as floor fact. With distillation, you solely want prompts. Amazon Bedrock robotically invokes the instructor mannequin to generate high-quality responses. It applies information synthesis and augmentation strategies behind the scenes to provide a various coaching dataset of as much as 15,000 prompt-response pairs.
That stated, you may optionally present a labeled dataset in order for you extra management over the coaching sign. Every document within the JSONL file follows the bedrock-conversation-2024 schema, the place the person position (the enter immediate) is required, and the assistant position (the specified response) is optionally available. See the next examples, and reference Put together your coaching datasets for distillation for extra element:
{
“schemaVersion”: “bedrock-conversation-2024”,
“system”: [{ “text”: “Return JSON with visual, audio, transcription, metadata weights (sum=1.0) and reasoning for the given video search query.” }],
“messages”: [
{
“role”: “user”,
“content”: [{ “text”: “Olivia talking about growing up in poverty” }]
},
{
“position”: “assistant”,
“content material”: [{ “text”: ” {“visual”: 0.2, “audio”: 0.1, “transcription”: 0.6, “metadata”: 0.1, “reasoning”: “The query focuses on spoken content (‘talking about’), making transcription most important. Visual and audio elements are secondary since they support the context, while metadata is minimal.”}”}]
}
]
}
For this publish, we ready 10,000 artificial labeled examples utilizing Nova Premier, the biggest and most succesful mannequin within the Nova household. The information was generated with a balanced distribution throughout visible, audio, transcription, and metadata sign queries, The examples cowl the total vary of anticipated search inputs, symbolize completely different problem ranges, embrace edge circumstances and variations, and forestall overfitting to slender question patterns. The next chart reveals the load distribution throughout the 4 modality channels.
Determine 2: The load distribution throughout the ten,000 coaching examples
For those who want further examples or wish to adapt the question distribution to your individual content material area, the offered generate_training_data.py script can be utilized to synthetically generate extra coaching information utilizing Nova Premier.
Run distillation coaching job
With the coaching information uploaded to Amazon S3, the following step is to submit the distillation job. Mannequin distillation works through the use of your prompts to first generate responses from the instructor mannequin. It then makes use of these prompt-response pairs to fine-tune the pupil mannequin. On this venture, the instructor is Amazon Nova Premier and the scholar is Amazon Nova Micro, a quick, cost-efficient mannequin optimized for high-throughput inference. The instructor’s routing choices turn into the coaching sign that shapes the scholar’s habits.
Amazon Bedrock manages all the coaching orchestration and infrastructure robotically. There isn’t any cluster provisioning, no hyperparameter tuning, and no teacher-to-student mannequin pipeline setup required. You specify the instructor mannequin, the scholar mannequin, the S3 path to your coaching information, and an AWS Identification and Entry Administration (IAM) position with the required permissions. Bedrock handles the remainder. The next is an instance code snippet to set off the distillation coaching job:
import boto3
from datetime import datetime
bedrock_client = boto3.shopper(service_name=”bedrock”)
teacher_model = “us.amazon.nova-premier-v1:0”
student_model = “amazon.nova-micro-v1:0:128k”
job_name = f”video-search-distillation-{datetime.now().strftime(‘%Y-%m-%d-%H-%M-%S’)}”
model_name = “nova-micro-video-router-v1″
response = bedrock_client.create_model_customization_job(
jobName=job_name,
customModelName=model_name,
roleArn=distillation_role_arn,
baseModelIdentifier=student_model,
customizationType=”DISTILLATION”,
trainingDataConfig={“s3Uri”: training_s3_uri},
outputDataConfig={“s3Uri”: output_s3_uri},
customizationConfig={
“distillationConfig”: {
“teacherModelConfig”: {
“teacherModelIdentifier”: teacher_model,
“maxResponseLengthForInference”: 1000
}
}
}
)
job_arn = response[‘jobArn’]
The job runs asynchronously. You’ll be able to monitor progress within the Amazon Bedrock console below Basis fashions > Customized fashions, or programmatically:
standing = bedrock_client.get_model_customization_job(
jobIdentifier=job_arn)[‘status’]
print(f”Job standing: {standing}”) # Coaching, Full, or Failed
Coaching time varies relying on the dataset dimension and the scholar mannequin chosen. For 10,000 labeled examples with Nova Micro, count on the job to finish inside a couple of hours.
Deploy the distilled mannequin
As soon as the distillation job is full, the customized mannequin is offered in your Amazon Bedrock account and able to deploy. Amazon Bedrock affords two deployment choices for customized fashions: Provisioned Throughput for predictable, high-volume workloads, and On-Demand Inference for versatile, pay-per-use entry with no upfront dedication.
For many groups getting began, on-demand inference is the advisable path. There isn’t any endpoint to provision, no hourly dedication, and no minimal utilization requirement. The next is the deployment code:
import uuid
deployment_name = f”nova-micro-video-router-{datetime.now().strftime(‘%Y-%m-%d’)}”
response = bedrock_client.create_custom_model_deployment(
modelDeploymentName=deployment_name,
modelArn=custom_model_arn,
description=”Distilled Nova Micro for video search modality weight prediction (4 weights)”,
tags=[
{“key”: “UseCase”, “value”: “VideoSearch”},
{“key”: “Version”, “value”: “v2-4weights”},
],
clientRequestToken=f”deployment-{uuid.uuid4()}”,
)
deployment_arn = response[‘modelDeploymentArn’]
print(f”Deployment ARN: {deployment_arn}”)
As soon as the standing reveals InService, you can invoke the distilled mannequin precisely as you’ll another base mannequin utilizing the usual InvokeModel or Converse API. You pay just for the tokens you devour at Nova Micro inference charges: $0.000035 per 1,000 enter tokens and $0.000140 per 1,000 output tokens.
import boto3
import json
bedrock_runtime = boto3.shopper(service_name=”bedrock-runtime”)
custom_model_arn = bedrock_client.get_model_customization_job(
jobIdentifier=job_arn
)[‘outputModelArn’]
response = bedrock_runtime.converse(
modelId=custom_model_arn,
messages=[
{
“role”: “user”,
“content”: [{“text”: query}]
}
]
)
routing_weights = json.masses(
response[‘output’][‘message’][‘content’][‘text’]
)
print(routing_weights)
# {“visible”: 0.7, “audio”: 0.1, “transcription”: 0.1, “metadata”: 0.1}
Consider the distilled mannequin
Earlier than evaluating towards the unique router, it’s price validating that distillation improved the bottom mannequin’s skill to comply with the routing job. The next desk reveals the identical immediate run by means of base Nova Micro and the distilled Nova Micro aspect by aspect.
Question
Distilled Nova Micro
Base Nova Micro
“CEO discussing quarterly earnings ”
{“visible”: 0.2, “audio”: 0.3, “transcription”: 0.4, “metadata”: 0.1, “reasoning”: “The question focuses on spoken content material (transcription) about earnings, however visible cues (CEO’s look) and audio (tone/readability) are additionally essential…”}
Right here’s a JSON illustration of the knowledge you requested for a video search question a couple of CEO discussing quarterly earnings:
“`json{ “video”: { “visible”: 0.3, “audio”: 0.3, “transcription”: 0.2, “metadata”: 0.1, “reasoning”: “The visible part contains the CEO’s pres….
“sundown over mountains”
{“visible”: 0.8, “audio”: 0.0, “transcription”: 0.0, “metadata”: 0.2, “reasoning”: “The question focuses on a visible scene (sundown over mountains), with no audio or transcription parts. Metadata may embrace location or time-related tags.”}
Right here’s a JSON illustration for a video search question “sundown over mountains” that features visible, audio, transcription, metadata weights (sum=1.0), and reasoning:
“`json{ “question”: “sundown over mountains”, “outcomes”: [ { “video_id”: “123456”, “visual”: 0.4, “audio”: 0.3 ….
The base model struggles with both instructions and output format consistency. It produces free-text responses, incomplete JSON, and non-numeric weight values. The distilled model consistently returns well-formed JSON with four numeric weights that sum to 1.0, matching the schema required by the routing pipeline.
Comparing against the original Claude Haiku router, both models are evaluated against a held-out set of 100 labeled examples generated by Nova Premier. We use Amazon Bedrock Model Evaluation to run the comparison in a structured, managed workflow. To assess routing quality beyond standard metrics, we defined a custom OverallQuality rubric (see the following code block) that instructs Claude Sonnet to score each prediction on two dimensions: weight accuracy against ground truth and reasoning quality. Each dimension maps to a concrete 5-point threshold, so the rubric penalizes both numerical drift and generic boilerplate reasoning.
“rating_scale”: [
{“definition”: “Weights within 0.05 of reference. Reasoning is specific and consistent.”,
“value”: {“floatValue”: 5.0}},
{“definition”: “Weights within 0.10 of reference. Reasoning is clear and mostly consistent.”,
“value”: {“floatValue”: 4.0}},
{“definition”: “Dominant modality matches. Avg error < 0.15. Reasoning is present but generic.”,
“value”: {“floatValue”: 3.0}},
{“definition”: “Dominant modality wrong OR avg error > 0.15. Reasoning vague or inconsistent.”,
“value”: {“floatValue”: 2.0}},
{“definition”: “Unparseable JSON, missing keys, or error > 0.30. No useful reasoning.”,
“value”: {“floatValue”: 1.0}},
]
The distilled Nova Micro mannequin achieved a big language mannequin (LLM)-as-judge rating of 4.0 out of 5, near-identical routing high quality to Claude 4.5 Haiku at roughly half the latency (833ms vs. 1,741ms). The associated fee benefit is equally vital. Switching to the distilled Nova Micro mannequin reduces inference prices by over 95% on each enter and output tokens, with no upfront commitments below on-demand pricing. Be aware: LLM-as-judge analysis is non-deterministic. Scores might range barely throughout runs.
Determine 3: Mannequin efficiency comparability (Distilled Nova Micro vs. Claude 4.5 Haiku)
The next is a desk abstract of side-by-side outcomes:
Metric
Distilled Nova Micro
Claude 4.5 Haiku
LLM-as-judge Rating
4.0 / 5
4.0 / 5
Imply Latency
833ms
1,741ms
Enter Token Price
$0.000035 / 1K
$0.80–$1.00 / 1K
Output Token Price
$0.000140 / 1K
$4.00–$5.00 / 1K
Output Format
Constant JSON
Inconsistent
Clear up
To keep away from ongoing expenses, run the cleanup part of the pocket book to take away any provisioned sources, together with deployed mannequin endpoints and any information saved in Amazon S3.
Conclusion
This publish is the second a part of a two-part collection. Constructing on Half 1, this publish focuses on making use of mannequin distillation to optimize the intent routing layer constructed within the video semantic search resolution. The strategies mentioned assist deal with actual manufacturing tradeoffs, comparable to balancing routing intelligence with latency and value at scale whereas sustaining search accuracy. By distilling Amazon Nova Premier’s routing habits into Amazon Nova Micro utilizing Amazon Bedrock Mannequin Distillation, we diminished inference price by over 95% and reduce preprocessing latency in half whereas preserving the nuanced routing high quality that the duty calls for. In case you are working multimodal video search at scale, mannequin distillation is a sensible path to production-grade price effectivity with out sacrificing search accuracy. To discover the total implementation, go to the GitHub repository and take a look at the answer your self.
Concerning the authors
Amit Kalawat
Amit Kalawat is a Principal Options Architect at Amazon Internet Providers based mostly out of New York. He works with enterprise clients as they remodel their enterprise and journey to the cloud.
James Wu
James Wu is a Principal GenAI/ML Specialist Options Architect at AWS, serving to enterprises design and execute AI transformation methods. Specializing in generative AI, agentic programs, and media provide chain automation, he’s a featured convention speaker and technical writer. Previous to AWS, he was an architect, developer, and expertise chief for over 10 years, with expertise spanning engineering and advertising and marketing industries.
Bimal Gajjar
Bimal Gajjar is a Senior Options Architect at AWS, the place he companions with International Accounts to design, undertake, and deploy scalable cloud storage and information options. With over 25 years of expertise working with main OEMs, together with HPE, Dell EMC, and Pure Storage, Bimal combines deep technical experience with strategic enterprise perception, drawn from end-to-end involvement in pre-sales structure and world service supply.

