NVIDIA Nemotron 3 Nano Omni mannequin now obtainable on Amazon SageMaker JumpStart

At the moment, we’re excited to announce the day zero availability of NVIDIA Nemotron 3 Nano Omni on Amazon SageMaker JumpStart. This multimodal mannequin from NVIDIA combines video, audio, picture, and textual content understanding right into a single, environment friendly structure, enabling enterprise clients to construct clever purposes that may see, hear, and cause throughout modalities in a single inference go.

On this put up, we stroll by way of the mannequin structure and key capabilities of Nemotron 3 Nano Omni, discover the enterprise use instances it unlocks, and present you how you can deploy and run inference utilizing Amazon SageMaker JumpStart.

Overview of NVIDIA Nemotron 3 Nano Omni

NVIDIA Nemotron 3 Nano Omni is an open, multimodal massive language mannequin with 30 billion complete parameters and three billion lively parameters (30B A3B). It’s constructed on a Mamba2 Transformer Hybrid Combination of Specialists (MoE) structure, combining three core elements:

Nemotron 3 Nano LLM because the language spine
CRADIO v4-H because the imaginative and prescient encoder for picture and video understanding
Parakeet because the speech encoder for audio transcription and comprehension

This unified structure processes video, audio, photographs, and textual content as enter and generates textual content as output. It helps a 131K token context size, chain of thought reasoning, software calling, JSON output, and phrase stage timestamps for transcription duties. The mannequin is on the market in FP8 precision on SageMaker JumpStart, delivering an optimum steadiness of accuracy and effectivity for enterprise workloads. It’s licensed below the NVIDIA Open Mannequin Settlement for industrial use.Enterprise agent workflows are inherently multimodal. Brokers should interpret screens, paperwork, audio, video, and textual content, typically inside the identical reasoning loop. At the moment, most agentic methods sew collectively separate fashions for imaginative and prescient, speech, and language. This strategy will increase latency by way of repeated inference passes, complicates orchestration and error dealing with, fragments context throughout modalities, and amplifies value and failure modes over time.

Nemotron 3 Nano Omni solves this by functioning because the multimodal notion and context sub-agent in a system of brokers. It gives the agent system with eyes and ears: studying screens, deciphering paperwork, transcribing speech, and analyzing video, all whereas sustaining a converged multimodal context throughout reasoning loops.Nano Omni understands screens, paperwork, audio, and video in a single reasoning loop. This replaces fragmented mannequin stacks and simplifies agent workflow design considerably. For anybody constructing agentic architectures, this collapses inference hops, orchestration logic, and cross-model synchronization overhead right into a single mannequin name.The mannequin accepts the next enter sorts:

Enter Sort
Supported Codecs
Constraints

Video
mp4
As much as 2 minutes, as much as 256 frames

Audio
wav, mp3
As much as 1 hour, 8kHz+ sampling charge

Picture
JPEG, PNG (RGB)
Customary decision

Textual content
String
As much as 131K context

Enterprise use instances

The multimodal capabilities of Nemotron 3 Nano Omni make it a strong, versatile mannequin selection for enterprise use instances.

Pc use brokers

Nemotron 3 Nano Omni powers the notion loop for brokers navigating graphical person interfaces. It reads screens, understands UI state over time, and validates outcomes, whereas execution brokers deal with the actions. This collapses imaginative and prescient and reasoning right into a single loop, eliminating the necessity for cut up notion pipelines. Sensible purposes embrace incident administration dashboards, agentic search, browser automation, and e mail workflow brokers.

Doc intelligence

The mannequin interprets paperwork, charts, tables, screenshots, and blended media inputs, enabling brokers to cause throughout visible construction and textual content content material coherently. That is essential for enterprise evaluation and compliance workflows involving contracts, statements of labor, monetary paperwork, and scientific literature.

Audio and video understanding brokers

For customer support, analysis, and monitoring workflows, Nemotron 3 Nano Omni maintains steady audio and video context. It ties collectively what was stated, proven, and documented right into a single reasoning stream as an alternative of disconnected summaries. This permits purposes akin to assembly recording evaluation, media and leisure asset administration, drive-thru order verification, and customer support video assessment (for instance, verifying bundle supply at a given deal with by way of OCR).

Getting began with SageMaker JumpStart

You’ll be able to deploy Nemotron 3 Nano Omni by way of Amazon SageMaker JumpStart in a couple of steps. SageMaker JumpStart gives one-click deployment of basis fashions with optimized inference containers, eradicating the necessity to handle infrastructure, configure serving frameworks, or deal with mannequin artifact downloads.

Stipulations

Earlier than you start, ensure you have:

Deploy utilizing SageMaker Studio

Open Amazon SageMaker Studio
Within the left navigation pane, select JumpStart
Seek for Nemotron 3 Nano Omni
Choose the mannequin card and select Deploy
Configure your occasion sort and deployment settings
Select Deploy to create the endpoint

Deploy utilizing the SageMaker Python SDK

You may also deploy programmatically utilizing the SageMaker Python SDK:

from sagemaker.jumpstart.mannequin import JumpStartModel

mannequin = JumpStartModel(
model_id=”huggingface-vlm-nvidia-nemotron3-nano-omni-30ba3b-reasoning-fp8″,
position=””,
)

predictor = mannequin.deploy(
accept_eula=True,
)

Run inference: Picture understanding

As soon as deployed, you may ship multimodal requests to the endpoint. The next instance reveals how you can ship a picture understanding request:

import base64

def encode_image(image_path):
with open(image_path, “rb”) as f:
return base64.b64encode(f.learn()).decode(“utf-8”)

image_b64 = encode_image(“instance.jpg”)

payload = {
“messages”: [{
“role”: “user”,
“content”: [
{“type”: “text”, “text”: “Describe this image in detail.”},
{“type”: “image_url”,
“image_url”: {“url”: f”data:image/jpeg;base64,{image_b64}”}},
],
}],
“max_tokens”: 1024,
“temperature”: 0.2,
}

response = predictor.predict(payload)
print(response[“choices”][0][“message”][“content”])

Run inference: Video understanding with reasoning

import base64

def encode_video(video_path):
with open(video_path, “rb”) as f:
return base64.b64encode(f.learn()).decode(“utf-8”)

video_b64 = encode_video(“meeting_recording.mp4”)

payload = {
“messages”: [{
“role”: “user”,
“content”: [
{“type”: “video_url”,
“video_url”: {“url”: f”data:video/mp4;base64,{video_b64}”}},
{“type”: “text”,
“text”: “Summarize the key discussion points.”},
],
}],
“max_tokens”: 20480,
“temperature”: 0.6,
“top_p”: 0.95,
}

response = predictor.predict(payload)
print(response[“choices”][0][“message”][“content”])

Run inference: Audio transcription

import base64

def encode_audio(audio_path):
with open(audio_path, “rb”) as f:
return base64.b64encode(f.learn()).decode(“utf-8”)

audio_b64 = encode_audio(“customer_call.wav”)

payload = {
“messages”: [{
“role”: “user”,
“content”: [
{“type”: “audio_url”,
“audio_url”: {“url”: f”data:audio/wav;base64,{audio_b64}”}},
{“type”: “text”,
“text”: “Transcribe this audio and identify key action items.”},
],
}],
“max_tokens”: 1024,
“temperature”: 0.2,
}

response = predictor.predict(payload)
print(response[“choices”][0][“message”][“content”])

Beneficial inference parameters

The next desk incorporates the really helpful hyperparameter values for Omni inference requests. The values change relying on the inference mode.

Mode
Temperature
top_p
max_tokens
Use Case

Pondering
0.6
0.95
20480
Advanced reasoning

Instruct
0.2
N/A
1024
Common duties, ASR

For duties that contain reasoning and sophisticated understanding, we suggest enabling pondering mode. For transcription and simple duties, instruct mode (with pondering disabled) gives quicker responses.

Clear up

To keep away from incurring pointless expenses, delete the SageMaker endpoint if you find yourself executed:

predictor.delete_endpoint()

Conclusion

NVIDIA Nemotron 3 Nano Omni brings a brand new stage of multimodal intelligence to Amazon SageMaker JumpStart. By unifying video, audio, picture, and textual content understanding right into a single environment friendly mannequin, it simplifies the event of enterprise agentic purposes whereas delivering main accuracy and as much as 9x greater throughput in comparison with different open omni fashions.

Whether or not you’re constructing laptop use brokers that navigate GUIs, doc intelligence pipelines for compliance workflows, or audio and video evaluation methods for customer support, Nemotron 3 Nano Omni gives the notion layer your brokers want in a single mannequin name.

Get began at the moment by deploying Nemotron 3 Nano Omni from Amazon SageMaker JumpStart. For extra details about the mannequin, go to the NVIDIA Nemotron mannequin web page on Hugging Face.

In regards to the authors

Dan Ferguson is a Options Architect at AWS, based mostly in New York, USA. As a machine studying companies professional, Dan works to help clients on their journey to integrating ML workflows effectively, successfully, and sustainably.

Malav Shastri is a Software program Growth Engineer at AWS, the place he works on the Amazon SageMaker JumpStart and Amazon Bedrock groups. His position focuses on enabling clients to make the most of state-of-the-art open supply and proprietary basis fashions and conventional machine studying algorithms. Malav holds a Grasp’s diploma in Pc Science.

Vivek Gangasani is a Worldwide Chief for Options Structure, SageMaker Inference. He leads Answer Structure, Technical Go-to-Market (GTM) and Outbound Product technique for SageMaker Inference. He additionally helps enterprises and startups deploy and optimize a GenAI fashions and construct AI workflows with SageMaker and GPUs. Presently, he’s targeted on growing methods and content material for optimizing inference efficiency and use-cases akin to Agentic workflows, RAG and so on. In his free time, Vivek enjoys mountain climbing, watching films, and attempting totally different cuisines.

What's Hot

SXSW Used AI-Powered Trademark Software To Censor Dissent on Instagram

LIV Golf postpones New Orleans occasion as state seeks return of $1.2m incentive funds | LIV Golf Collection

Google Pockets will get gradient icon redesign that subtly glows

SXSW Used AI-Powered Trademark Software To Censor Dissent on Instagram

I’ve examined lots of the world’s greatest e-readers — and some of them are closely discounted at Amazon proper now

Nvidia quietly launched a brand new model of GeForce RTX 5070 GPU inside a driver weblog put up

Scientists Investigated a Frequency Linked to ‘Paranormal’ Encounters. The Outcomes Have been Unsettling.

Native Whisper Audio Transcription – KDnuggets

NVIDIA begins providing a 12GB model of the 5070 for laptops

SXSW Used AI-Powered Trademark Software To Censor Dissent on Instagram

LIV Golf postpones New Orleans occasion as state seeks return of $1.2m incentive funds | LIV Golf Collection

Google Pockets will get gradient icon redesign that subtly glows

SXSW Used AI-Powered Trademark Software To Censor Dissent on Instagram

LIV Golf postpones New Orleans occasion as state seeks return of $1.2m incentive funds | LIV Golf Collection

Google Pockets will get gradient icon redesign that subtly glows

Usefull link

categories

What's Hot

Overview of NVIDIA Nemotron 3 Nano Omni

Enterprise use instances

Pc use brokers

Doc intelligence

Audio and video understanding brokers

Getting began with SageMaker JumpStart

Stipulations

Deploy utilizing SageMaker Studio

Deploy utilizing the SageMaker Python SDK

Run inference: Picture understanding

Run inference: Video understanding with reasoning

Run inference: Audio transcription

Beneficial inference parameters

Clear up

Conclusion

In regards to the authors

Related Posts

Usefull link

categories