At the moment, we’re excited to announce the day zero availability of NVIDIA Nemotron 3 Nano Omni on Amazon SageMaker JumpStart. This multimodal mannequin from NVIDIA combines video, audio, picture, and textual content understanding right into a single, environment friendly structure, enabling enterprise clients to construct clever purposes that may see, hear, and cause throughout modalities in a single inference go.
On this put up, we stroll by way of the mannequin structure and key capabilities of Nemotron 3 Nano Omni, discover the enterprise use instances it unlocks, and present you how you can deploy and run inference utilizing Amazon SageMaker JumpStart.
Overview of NVIDIA Nemotron 3 Nano Omni
NVIDIA Nemotron 3 Nano Omni is an open, multimodal massive language mannequin with 30 billion complete parameters and three billion lively parameters (30B A3B). It’s constructed on a Mamba2 Transformer Hybrid Combination of Specialists (MoE) structure, combining three core elements:
- Nemotron 3 Nano LLM because the language spine
- CRADIO v4-H because the imaginative and prescient encoder for picture and video understanding
- Parakeet because the speech encoder for audio transcription and comprehension
This unified structure processes video, audio, photographs, and textual content as enter and generates textual content as output. It helps a 131K token context size, chain of thought reasoning, software calling, JSON output, and phrase stage timestamps for transcription duties. The mannequin is on the market in FP8 precision on SageMaker JumpStart, delivering an optimum steadiness of accuracy and effectivity for enterprise workloads. It’s licensed below the NVIDIA Open Mannequin Settlement for industrial use.Enterprise agent workflows are inherently multimodal. Brokers should interpret screens, paperwork, audio, video, and textual content, typically inside the identical reasoning loop. At the moment, most agentic methods sew collectively separate fashions for imaginative and prescient, speech, and language. This strategy will increase latency by way of repeated inference passes, complicates orchestration and error dealing with, fragments context throughout modalities, and amplifies value and failure modes over time.
Nemotron 3 Nano Omni solves this by functioning because the multimodal notion and context sub-agent in a system of brokers. It gives the agent system with eyes and ears: studying screens, deciphering paperwork, transcribing speech, and analyzing video, all whereas sustaining a converged multimodal context throughout reasoning loops.Nano Omni understands screens, paperwork, audio, and video in a single reasoning loop. This replaces fragmented mannequin stacks and simplifies agent workflow design considerably. For anybody constructing agentic architectures, this collapses inference hops, orchestration logic, and cross-model synchronization overhead right into a single mannequin name.The mannequin accepts the next enter sorts:
Enter Sort
Supported Codecs
Constraints
Video
mp4
As much as 2 minutes, as much as 256 frames
Audio
wav, mp3
As much as 1 hour, 8kHz+ sampling charge
Picture
JPEG, PNG (RGB)
Customary decision
Textual content
String
As much as 131K context
Enterprise use instances
The multimodal capabilities of Nemotron 3 Nano Omni make it a strong, versatile mannequin selection for enterprise use instances.
Pc use brokers
Nemotron 3 Nano Omni powers the notion loop for brokers navigating graphical person interfaces. It reads screens, understands UI state over time, and validates outcomes, whereas execution brokers deal with the actions. This collapses imaginative and prescient and reasoning right into a single loop, eliminating the necessity for cut up notion pipelines. Sensible purposes embrace incident administration dashboards, agentic search, browser automation, and e mail workflow brokers.
Doc intelligence
The mannequin interprets paperwork, charts, tables, screenshots, and blended media inputs, enabling brokers to cause throughout visible construction and textual content content material coherently. That is essential for enterprise evaluation and compliance workflows involving contracts, statements of labor, monetary paperwork, and scientific literature.
Audio and video understanding brokers
For customer support, analysis, and monitoring workflows, Nemotron 3 Nano Omni maintains steady audio and video context. It ties collectively what was stated, proven, and documented right into a single reasoning stream as an alternative of disconnected summaries. This permits purposes akin to assembly recording evaluation, media and leisure asset administration, drive-thru order verification, and customer support video assessment (for instance, verifying bundle supply at a given deal with by way of OCR).
Getting began with SageMaker JumpStart
You’ll be able to deploy Nemotron 3 Nano Omni by way of Amazon SageMaker JumpStart in a couple of steps. SageMaker JumpStart gives one-click deployment of basis fashions with optimized inference containers, eradicating the necessity to handle infrastructure, configure serving frameworks, or deal with mannequin artifact downloads.
Stipulations
Earlier than you start, ensure you have:
Deploy utilizing SageMaker Studio
- Open Amazon SageMaker Studio
- Within the left navigation pane, select JumpStart
- Seek for Nemotron 3 Nano Omni
- Choose the mannequin card and select Deploy
- Configure your occasion sort and deployment settings
- Select Deploy to create the endpoint
Deploy utilizing the SageMaker Python SDK
You may also deploy programmatically utilizing the SageMaker Python SDK:
from sagemaker.jumpstart.mannequin import JumpStartModel
mannequin = JumpStartModel(
model_id=”huggingface-vlm-nvidia-nemotron3-nano-omni-30ba3b-reasoning-fp8″,
position=””,
)
predictor = mannequin.deploy(
accept_eula=True,
)
Run inference: Picture understanding
As soon as deployed, you may ship multimodal requests to the endpoint. The next instance reveals how you can ship a picture understanding request:
import base64
def encode_image(image_path):
with open(image_path, “rb”) as f:
return base64.b64encode(f.learn()).decode(“utf-8”)
image_b64 = encode_image(“instance.jpg”)
payload = {
“messages”: [{
“role”: “user”,
“content”: [
{“type”: “text”, “text”: “Describe this image in detail.”},
{“type”: “image_url”,
“image_url”: {“url”: f”data:image/jpeg;base64,{image_b64}”}},
],
}],
“max_tokens”: 1024,
“temperature”: 0.2,
}
response = predictor.predict(payload)
print(response[“choices”][0][“message”][“content”])
Run inference: Video understanding with reasoning
import base64
def encode_video(video_path):
with open(video_path, “rb”) as f:
return base64.b64encode(f.learn()).decode(“utf-8”)
video_b64 = encode_video(“meeting_recording.mp4”)
payload = {
“messages”: [{
“role”: “user”,
“content”: [
{“type”: “video_url”,
“video_url”: {“url”: f”data:video/mp4;base64,{video_b64}”}},
{“type”: “text”,
“text”: “Summarize the key discussion points.”},
],
}],
“max_tokens”: 20480,
“temperature”: 0.6,
“top_p”: 0.95,
}
response = predictor.predict(payload)
print(response[“choices”][0][“message”][“content”])
Run inference: Audio transcription
import base64
def encode_audio(audio_path):
with open(audio_path, “rb”) as f:
return base64.b64encode(f.learn()).decode(“utf-8”)
audio_b64 = encode_audio(“customer_call.wav”)
payload = {
“messages”: [{
“role”: “user”,
“content”: [
{“type”: “audio_url”,
“audio_url”: {“url”: f”data:audio/wav;base64,{audio_b64}”}},
{“type”: “text”,
“text”: “Transcribe this audio and identify key action items.”},
],
}],
“max_tokens”: 1024,
“temperature”: 0.2,
}
response = predictor.predict(payload)
print(response[“choices”][0][“message”][“content”])
Beneficial inference parameters
The next desk incorporates the really helpful hyperparameter values for Omni inference requests. The values change relying on the inference mode.
Mode
Temperature
top_p
max_tokens
Use Case
Pondering
0.6
0.95
20480
Advanced reasoning
Instruct
0.2
N/A
1024
Common duties, ASR
For duties that contain reasoning and sophisticated understanding, we suggest enabling pondering mode. For transcription and simple duties, instruct mode (with pondering disabled) gives quicker responses.
Clear up
To keep away from incurring pointless expenses, delete the SageMaker endpoint if you find yourself executed:
predictor.delete_endpoint()
Conclusion
NVIDIA Nemotron 3 Nano Omni brings a brand new stage of multimodal intelligence to Amazon SageMaker JumpStart. By unifying video, audio, picture, and textual content understanding right into a single environment friendly mannequin, it simplifies the event of enterprise agentic purposes whereas delivering main accuracy and as much as 9x greater throughput in comparison with different open omni fashions.
Whether or not you’re constructing laptop use brokers that navigate GUIs, doc intelligence pipelines for compliance workflows, or audio and video evaluation methods for customer support, Nemotron 3 Nano Omni gives the notion layer your brokers want in a single mannequin name.
Get began at the moment by deploying Nemotron 3 Nano Omni from Amazon SageMaker JumpStart. For extra details about the mannequin, go to the NVIDIA Nemotron mannequin web page on Hugging Face.
In regards to the authors
Dan Ferguson is a Options Architect at AWS, based mostly in New York, USA. As a machine studying companies professional, Dan works to help clients on their journey to integrating ML workflows effectively, successfully, and sustainably.
Malav Shastri is a Software program Growth Engineer at AWS, the place he works on the Amazon SageMaker JumpStart and Amazon Bedrock groups. His position focuses on enabling clients to make the most of state-of-the-art open supply and proprietary basis fashions and conventional machine studying algorithms. Malav holds a Grasp’s diploma in Pc Science.
Vivek Gangasani is a Worldwide Chief for Options Structure, SageMaker Inference. He leads Answer Structure, Technical Go-to-Market (GTM) and Outbound Product technique for SageMaker Inference. He additionally helps enterprises and startups deploy and optimize a GenAI fashions and construct AI workflows with SageMaker and GPUs. Presently, he’s targeted on growing methods and content material for optimizing inference efficiency and use-cases akin to Agentic workflows, RAG and so on. In his free time, Vivek enjoys mountain climbing, watching films, and attempting totally different cuisines.

