Many organizations are archiving giant media libraries, analyzing contact middle recordings, making ready coaching knowledge for AI, or processing on-demand video for subtitles. When knowledge volumes develop considerably, managed automated speech recognition (ASR) service prices can shortly turn into the first constraint on scalability.
To handle this cost-scalability problem, we use the NVIDIA Parakeet-TDT-0.6B-v3 mannequin, deployed by way of AWS Batch on GPU-accelerated situations. Parakeet-TDT’s Token-and-Length Transducer structure concurrently predicts textual content tokens and their period to intelligently skip silence and redundant processing. This helps obtain inference speeds orders of magnitude quicker than real-time. By paying just for temporary bursts of compute slightly than the complete size of your audio, you possibly can transcribe at scale for fractions of a cent per hour of audio based mostly on the benchmarks described on this submit.
On this submit, we stroll by way of constructing a scalable, event-driven transcription pipeline that robotically processes audio recordsdata uploaded to Amazon Easy Storage Service (Amazon S3), and present you tips on how to use Amazon EC2 Spot Situations and buffered streaming inference to additional scale back prices.
Mannequin capabilities
Parakeet-TDT-0.6B-v3, launched in August 2025, is an open-source multilingual ASR mannequin that delivers excessive accuracy throughout 25 European languages with automated language detection and versatile licensing below CC-BY-4.0. In keeping with NVIDIA’s printed metrics, the mannequin maintains a 6.34% phrase error charge (WER) in clear situations and 11.66% WER at 0 dB SNR, and helps audio as much as three hours utilizing native consideration mode.
The 25 supported languages embody Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian, and Ukrainian. This may help alleviate the necessity for separate fashions or language-specific configuration when serving worldwide European economies.For deployment on AWS, the mannequin requires GPU-enabled situations with a minimal of 4 GB VRAM, although 8 GB supplies higher efficiency. G6 situations (NVIDIA L4 GPUs) present the perfect cost-to-performance ratio for inference workloads based mostly on our checks. The mannequin additionally performs nicely on G5 (A10G), G4dn (T4), and for optimum throughput, P5 (H100) or P4 (A100) situations.
Resolution structure
The method begins whenever you add an audio file to an S3 bucket. This triggers an Amazon EventBridge rule that submits a job to AWS Batch. AWS Batch provisions GPU-accelerated compute assets, and the provisioned situations pull our container picture with a pre-cached mannequin from Amazon Elastic Container Registry (Amazon ECR). The inference script downloads and processes the file, then uploads the timestamped JSON transcript to an output S3 bucket. The structure scales to zero when idle, so prices are incurred solely throughout energetic compute.
For a deep dive into the overall architectural parts, check with our earlier submit, Whisper audio transcription powered by AWS Batch and AWS Inferentia.
Determine 1. Occasion-driven audio transcription pipeline with Amazon EventBridge and AWS Batch
Stipulations
- Create an AWS account in case you don’t have already got one and register. Create a person utilizing AWS IAM Id Middle with full administrator permissions as described in Add customers.
- Set up the AWS Command Line Interface (AWS CLI) in your native growth machine and create a profile for the admin person as described in Arrange the AWS CLI.
- Set up Docker in your native machine.
- Clone the GitHub repository to your native machine.
Constructing the container picture
The repository features a Docker file that builds a streamlined container picture optimized for inference efficiency. The picture makes use of Amazon Linux 2023 as a base, installs Python 3.12, and pre-caches the Parakeet-TDT-0.6B-v3 mannequin through the construct to alleviate obtain latency at runtime:
FROM public.ecr.aws/amazonlinux/amazonlinux:2023
WORKDIR /app
# Set up system dependencies, Python 3.12, and ffmpeg
RUN dnf replace -y &&
dnf set up -y gcc-c++ python3.12-devel tar xz &&
ln -sf /usr/bin/python3.12 /usr/native/bin/python3 &&
python3 -m ensurepip &&
python3 -m pip set up –no-cache-dir –upgrade pip &&
dnf clear all && rm -rf /var/cache/dnf
# Set up Python dependencies and pre-cache the mannequin
COPY ./necessities.txt necessities.txt
RUN pip set up -U –no-cache-dir -r necessities.txt &&
rm -rf ~/.cache/pip /tmp/pip* &&
python3 -m compileall -q /usr/native/lib/python3.12/site-packages
COPY ./parakeet_transcribe.py parakeet_transcribe.py
# Cache mannequin throughout construct to eradicate runtime obtain
RUN python3 -c “from nemo.collections.asr.fashions import ASRModel;
ASRModel.from_pretrained(‘nvidia/parakeet-tdt-0.6b-v3’)”
CMD [“python3”, “parakeet_transcribe.py”]
Pushing to Amazon ECR
The repository consists of an updateImage.sh script that handles setting detection (CodeBuild or EC2), builds the container picture, creates an ECR repository if wanted, allows vulnerability scanning, and pushes the picture. Run it with:./updateImage.sh
Deploying the answer
The answer makes use of an AWS CloudFormation template (deployment.yaml) to provision the infrastructure. The buildArch.sh script automates the deployment by detecting your AWS Area, amassing VPC, subnet, and safety group data, and deploying the CloudFormation stack:
./buildArch.shUnder the hood, this runs:
aws cloudformation deploy –stack-name batch-gpu-audio-transcription
–template-file ./deployment.yaml
–capabilities CAPABILITY_IAM
–region ${AWS_REGION}
–parameter-overrides VPCId=${VPC_ID} SubnetIds=”${SUBNET_IDS}”
SGIds=”${SecurityGroup_IDS}” RTIds=”${RouteTable_IDS}”
The CloudFormation template creates the AWS Batch compute setting with G6 and G5 GPU situations, a job queue, a job definition referencing your ECR picture, enter and output S3 buckets with EventBridge notifications enabled. It additionally creates an EventBridge rule that triggers a Batch job on S3 add, an Amazon CloudWatch agent configuration for GPU/CPU/reminiscence monitoring, and IAM roles with least-privilege insurance policies. AWS Batch permits number of Amazon Linux 2023 GPU photos by specifying ImageType: ECS_AL2023_NVIDIA within the compute setting configuration.
Alternatively, you possibly can deploy immediately from the AWS CloudFormation console utilizing the launch hyperlink offered within the repository README.
Configuring Spot situations
Amazon EC2 Spot Situations may help additional scale back the prices, by operating your workloads on unused EC2 capability at a reduction of as much as 90% relying in your occasion kind. To allow Spot Situations, we modify the compute setting in deployment.yaml:
DefaultComputeEnv:
Sort: AWS::Batch::ComputeEnvironment
Properties:
Sort: MANAGED
State: ENABLED
ComputeResources:
AllocationStrategy: SPOT_PRICE_CAPACITY_OPTIMIZED
Sort: SPOT
BidPercentage: 100
InstanceTypes:
– “g6.xlarge”
– “g6.2xlarge”
– “g5.xlarge”
MinvCpus: !Ref DefaultCEMinvCpus
MaxvCpus: !Ref DefaultCEMaxvCpus
# … remaining configuration unchanged
You possibly can allow this by setting –parameter-overrides UseSpotInstances=Sure when operating aws cloudformation deploy. The SPOT_PRICE_CAPACITY_OPTIMIZED allocation technique selects Spot Occasion swimming pools which are each the least more likely to be interrupted and have the bottom potential worth. Diversifying occasion sorts (G6 xlarge, G6 2xlarge, G5 xlarge) can enhance Spot availability. Setting MinvCpus: 0 makes certain the setting scales to zero when idle, so that you keep away from incurring prices between workloads. Since ASR jobs are stateless and idempotent, they’re well-suited for Spot. If an occasion is reclaimed, AWS Batch robotically retries the job (configured with as much as 2 retry makes an attempt within the job definition).
Managing reminiscence for lengthy audio
The Parakeet-TDT mannequin’s reminiscence consumption scales linearly with audio period. The Quick Conformer encoder should generate and retailer characteristic representations for the complete audio sign, making a direct dependency the place doubling audio size roughly doubles VRAM utilization. In keeping with the mannequin card, with full consideration the mannequin can course of as much as 24 minutes given 80GB of VRAM.
NVIDIA addresses this with a native consideration mode that helps as much as 3 hours of audio on an 80 GB A100:
# Allow native consideration for lengthy audio
asr_model.change_attention_model(“rel_pos_local_attn”, [128, 128])
asr_model.change_subsampling_conv_chunking_factor(1) # auto choose
asr_model.transcribe([“input_audio.wav”])
This will include a slight accuracy hit, we advocate testing in your use case.
Buffered streaming inference
For audio that exceeds 3 hours, or to course of lengthy audio cost-effectively on commonplace {hardware} like a g6.xlarge, we use buffered streaming inference. Tailored from NVIDIA NeMo’s streaming inference instance, this method processes audio in overlapping chunks slightly than loading the complete context into reminiscence.
We configure 20-second chunks with 5-second left context and 3-second proper context to take care of transcription high quality at chunk boundaries (be aware that the accuracy could degrade when altering these parameters, so experiment to seek out the optimum configuration. Lowering the chunk_secs will increase processing time):
# Streaming inference loop
whereas left_sample < audio_batch.form[1]:
# add samples to buffer
chunk_length = min(right_sample, audio_batch.form[1]) – left_sample
# [Logic to manage buffer and flags omitted for brevity]
buffer.add_audio_batch_(…)
# Encode utilizing full buffer [left-chunk-right]
encoder_output, encoder_output_len = asr_model(
input_signal=buffer.samples,
input_signal_length=buffer.context_size_batch.complete(),
)
# Decode solely chunk frames (fixed reminiscence utilization)
chunk_batched_hyps, _, state = decoding_computer(…)
# Advance sliding window
left_sample = right_sample
right_sample = min(right_sample + context_samples.chunk, audio_batch.form[1])
Processing audio at mounted chunk sizes decouples VRAM utilization from complete audio size, permitting a single g6.xlarge occasion to course of a 10-hour file with the identical reminiscence footprint as a 10-minute one.
Determine 2. Buffered streaming inference processes audio in overlapping chunks with fixed reminiscence utilization.
To deploy with buffered streaming enabled, set the EnableStreaming=Sure parameter.
aws cloudformation deploy
–stack-name batch-gpu-audio-transcription
–template-file ./deployment.yaml
–capabilities CAPABILITY_IAM
–parameter-overrides EnableStreaming=Sure
VPCId=your-vpc-id SubnetIds=your-subnet-ids SGIds=your-sg-ids RTIds=your-rt-ids
Testing and monitoring
To validate the answer at scale, we ran an experiment with 1,000 similar 50-minute audio recordsdata from a NASA preflight crew information convention, distributed throughout 100 g6.xlarge situations processing 10 recordsdata every.
Determine 3. Batch jobs operating concurrently on 100 g6.xlarge situations.
The deployment consists of an Amazon CloudWatch agent configuration that collects GPU utilization, energy draw, VRAM utilization, CPU utilization, reminiscence consumption, and disk utilization at 10-second intervals. These metrics seem below the CWAgent namespace, enabling you to construct dashboards for real-time monitoring.
Efficiency and value evaluation
To validate the effectivity of the structure, we benchmarked the system utilizing a number of longform audio recordsdata.
The Parakeet-TDT-0.6B-v3 mannequin achieved a uncooked inference velocity of 0.24 seconds per minute of audio. Nevertheless, an entire pipeline additionally consists of overhead for loading the mannequin into reminiscence, loading audio, preprocessing the enter and post-processing the output. Due to this overhead, the optimum price optimization occurs for long-form audio to maximise the processing time.
Benchmark outcomes (g6.xlarge):
- Audio Length: 3 hours 25 minutes (205 minutes)
- Whole Job Length: 100 sec
- Efficient Processing Velocity: 0.49 seconds per minute of audio
- Price breakdown
Based mostly on pricing within the us-east-1 Area for the g6.xlarge occasion, we are able to estimate the associated fee per minute of audio processing.
Pricing Mannequin
Hourly Price (g6.xlarge)*
Price per Minute of Audio
On-Demand
~$0.805
**$0.00011**
Spot Situations
~$0.374
**$0.00005**
*Costs are estimates based mostly on us-east-1 charges on the time of writing. Spot costs fluctuate by Availability Zone and are topic to alter.
This comparability highlights the financial benefit of the self-hosted strategy for high-volume workloads, delivering worth for big scale transcriptions in comparison with managed API providers.
Cleanup
To keep away from incurring future costs, delete the assets created by this resolution:
- Empty all S3 buckets (enter, output, and logs).
- Delete the CloudFormation stack:
aws cloudformation delete-stack –stack-name batch-gpu-audio-transcription
- Optionally, take away the ECR repository and container photos.
For detailed cleanup directions, check with the cleanup part of the repository README.
Conclusion
On this submit, we demonstrated tips on how to construct an audio transcription pipeline that processes audio at scale for fractions of a cent per hour. By combining NVIDIA’s Parakeet-TDT-0.6B-v3 mannequin with AWS Batch and EC2 Spot Situations, you possibly can transcribe throughout 25 European languages with automated language detection and assist scale back prices in comparison with different options. The buffered streaming inference approach extends this functionality to audio of various size on commonplace {hardware}, and the event-driven structure scales robotically from zero to deal with variable workloads.
To get began, discover the pattern code within the GitHub repository.
Concerning the authors
Gleb Geinke
Gleb Geinke is a Deep Studying Architect on the AWS Generative AI Innovation Middle. Gleb collaborates immediately with enterprise clients to design and scale transformational generative AI options for complicated enterprise challenges.
Justin Leto
Justin Leto is a International Principal Options Architect with the Personal Fairness crew at AWS. Justin is the writer of Information Engineering with Generative and Agentic AI on AWS printed by APRESS.
Yusong Wang
Yusong Wang is a Principal Excessive-Efficiency Computing (HPC) Specialist Options Architect at AWS with over 20 years of expertise spanning nationwide analysis institutes and huge monetary enterprises.
Brian Maguire
Brian Maguire is a Principal Options Architect at Amazon Internet Companies, centered on serving to clients construct their concepts within the cloud. Brian is the co-author of Scalable Information Streaming with Amazon Kinesis.

