Tomofun, the Taiwan-headquartered pet-tech startup behind the Furbo Pet Digital camera, is redefining how pet homeowners work together with their pets remotely. Furbo combines good cameras with AI to detect behaviors resembling barking, working, or uncommon exercise, and alerts homeowners in actual time. On the core of this functionality are laptop imaginative and prescient and vision-language fashions that interpret pet actions from the video streams.
Initially, Furbo’s inference workloads have been hosted on GPU-based Amazon Elastic Compute Cloud (Amazon EC2) cases. Whereas GPUs offered excessive throughput, they have been additionally pricey as a result of the always-on inference wanted to help real-time pet exercise alerts at scale. To cut back prices and preserve accuracy, Tomofun turned to EC2 Inf2 cases powered by AWS Inferentia2, the Amazon purpose-built AI chips. On this publish, we stroll via the next sections intimately.
Problem: Lowering GPU inference price for real-time vision-language fashions at scale
Working superior vision-language fashions like Bootstrapping Language-image Pre-Coaching (BLIP), detailed within the unique paper, have been hosted on GPU cases and proved much less cost-effective for always-on, real-time inference workloads at scale. The problem was twofold: Tomofun wanted to maintain price effectivity for practically steady pet conduct monitoring throughout lots of of 1000’s of gadgets, whereas additionally sustaining mannequin constancy and throughput. Tomofun wanted to do that with out rewriting giant parts of the BLIP code base already optimized for PyTorch.
Answer overview
Earlier than diving into the structure, the next diagram gives a high-level view of how the system processes pet conduct detection at scale throughout AWS companies.
- Webcam interplay – Furbo’s API sits on the heart of Tomofun’s pet-behavior detection service, orchestrating picture streams from buyer’s pet cameras to inference endpoints in AWS. The diagram reveals the structure of Elastic Load Balancing (ELB) and Amazon EC2 Auto Scaling group applied utilizing EC2 Inf2 cases offering scaling because the inference quantity grows in real-time. When a digicam captures a body, the information is routed via Amazon CloudFront and an ELB to the primary layer of the EC2 Auto Scaling group that hosts the pet-behavior detection API servers. After the API layer processes every request, it forwards the picture to a second-layer Auto Scaling group devoted to working mannequin inference.
- Mannequin inference – After processing, the pictures are forwarded to a second layer EC2 Auto Scaling group containing inference cases. Inside this group, containers host the BLIP mannequin, which may run on Inferentia2-based EC2 Inf2 cases. The BLIP mannequin parts compiled utilizing the Neuron SDK are loaded into containers on Inf2 cases. Within the early implementation, Furbo’s API routed inference calls solely to GPU containers, however now it may well additionally direct requests to Inf2-based containers with out altering the upstream API or downstream alert logic. This structure permits Tomofun to direct inference requests to and swap between GPU and Inferentia2 backends in real-time. This maintains excessive availability and provides them the flexibleness to scale cost-efficient inference whereas preserving the identical API floor for Furbo customers.
- Metrics assortment – Amazon CloudWatch displays key operational metrics throughout the inference fleet, together with latency, throughput, and error charges. These indicators present the observability wanted to detect efficiency degradation early and be certain that service-level targets are met as visitors patterns shift all through the day.
- Scaling with Demand – The ELB dispatches requests to the obtainable cases throughout the Auto Scaling group, which manages the dimensions of the occasion pool dimension based mostly on the incoming request depend because the CloudWatch metric. This metric-driven strategy is adopted as a result of the throughput benchmarks for every occasion kind have already been established via stress testing, so scaling choices may be pushed immediately by the quantity of picture requests. The result’s an structure that scales cost-efficient inference capability in actual time, sustaining excessive availability as demand grows.
Enhancing BLIP on Inferentia2
Earlier than diving into the mannequin particulars, the next diagram gives a high-level overview of the BLIP structure and the way its core parts work together.
Supply: BLIP: Bootstrapping Language-Picture Pre-training for Unified Imaginative and prescient-Language Understanding and Technology, 2022 https://arxiv.org/pdf/2201.12086
BLIP consists of three parts—the Picture Encoder, Textual content Encoder, and Textual content Decoder, as proven within the picture. For help on Inferentia2, fashions may be damaged into parts and wrapped to suit enter and output shapes. Tomofun utilized this methodology to BLIP, creating light-weight wrappers for every of the three parts of the BLIP mannequin so the unique structure remained unchanged. Every part was compiled independently with torch_neuronx after which mixed into the inference pipeline, permitting inputs to circulate sequentially. This modular strategy maintained compatibility with Inferentia2 with out altering BLIP’s pretrained logic.
Authentic mannequin code
Step one is to isolate the unique BLIP Textual content Encoder so it may be compiled with out modifying its inside logic. The TextEncoder class is a skinny wrapper across the unique submodule (mannequin.text_encoder.mannequin) that standardizes the ahead output by returning solely the first tensor. This makes the part easy to hint and compile with Neuron whereas preserving the unique structure.
class TextEncoder(torch.nn.Module):
def __init__(self, mannequin):
tremendous().__init__()
self.mannequin = mannequin
def ahead(self, input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask):
output = self.mannequin(
input_ids=input_ids,
attention_mask=attention_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
return_dict=False,
)
return output[0]
Through the compilation section, the unique mannequin (mannequin.text_encoder.mannequin) is handed immediately into torch_neuronx.hint() and compiled right into a Neuron-optimized TorchScript artifact, with out modifying the pretrained BLIP logic.
Wrapper code
A wrapper is required as a result of the torch_neuronx.hint() API expects a tensor tuple of tensors as enter and output. To keep away from rewriting the mannequin, light-weight wrappers act as an adapter layer that reformats inputs and outputs whereas preserving the unique structure unchanged. This strategy minimizes code adjustments and permits the compiled parts to combine seamlessly into the present inference pipeline.
class TextEncoderWrapper(torch.nn.Module):
def __init__(self, mannequin):
tremendous().__init__()
self.mannequin = TextEncoder(mannequin)
@classmethod
def from_model(cls, mannequin):
wrapper = cls(mannequin)
wrapper.mannequin = mannequin
return wrapper
def ahead(self, input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask, return_dict):
output = self.mannequin(input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask)
return (output,)
The wrapper is used solely at deployment to load the compiled mannequin and format I/O, so it matches the present BLIP pipeline.
- Compile: use the unique mannequin (mannequin.text_encoder.mannequin)
- Deploy: use TextEncoderWrapper to run the compiled mannequin
This retains the unique code unchanged whereas making the compiled mannequin simple to plug into manufacturing.
Mannequin compilation for Inferentia2
Within the following code snippet, mannequin.text_encoder.mannequin represents the unmodified Textual content Encoder submodule, which is compiled right into a Neuron-optimized TorchScript format.
def trace_model(mannequin, listing, compiler_args=f”–auto-cast-type fp16 –logfile {LOG_DIR}/log-neuron-cc.txt”):
if os.path.isfile(listing):
print(f”Supplied path ({listing}) must be a listing, not a file”)
return
os.makedirs(listing, exist_ok=True)
os.makedirs(LOG_DIR, exist_ok=True)
# Skip hint if the mannequin is already traced
if not os.path.isfile(os.path.be a part of(listing, ‘text_encoder.pt’)):
print(“Tracing text_encoder”)
# Step 1: Present pseudo enter knowledge with anticipated shapes and dtypes
inputs = (
torch.ones((1, 8), dtype=torch.int64),
torch.ones((1, 8), dtype=torch.int64),
torch.ones((1, 577, 768), dtype=torch.float32),
torch.ones((1, 577), dtype=torch.int64),
)
# Step 2: Use torch_neuronx.hint() to compile the mannequin for Inferentia
encoder = torch_neuronx.hint(mannequin.text_encoder.mannequin,
inputs,
compiler_args=compiler_args)
# Step 3: Save the compiled mannequin as TorchScript artifact
torch.jit.save(encoder, os.path.be a part of(listing, ‘text_encoder.pt’))
else:
print(‘Skipping text_encoder.pt’)
To compile BLIP parts for Inferentia2, Tomofun outlined a hint perform that automates the conversion of GPU-trained PyTorch fashions into Inferentia-optimized artifacts. The method begins by making ready pseudo enter tensors that characterize the anticipated shapes and knowledge kinds of the mannequin’s inputs, which guides the tracing course of. After the inputs are outlined, the perform calls torch_neuronx.hint() to compile the BLIP sub-model for Inferentia execution, producing a Neuron-optimized model of the unique code. Lastly, the compiled artifact is saved with torch.jit.save, making it prepared for deployment on Inf2 cases. This three-step circulate—loading the wrapper, offering pseudo enter knowledge, and compiling with Neuron—makes positive that Tomofun can migrate BLIP’s TextDecoder and different parts with out altering the unique mannequin code.
Mannequin deployment on Inferentia2
Within the deployment section, the compiled submodules are loaded via wrapper lessons to assemble the ultimate BLIP inference pipeline. This separation creates a transparent workflow the place the unique mannequin parts are used immediately for Neuron enchancment throughout compilation, whereas the wrapper lessons deal with enter and output formatting throughout inference to make sure compatibility with Inferentia2. The deployment section code is as following:
fashions.text_encoder = TextEncoderWrapper.from_model(
torch.jit.load(os.path.be a part of(listing, ‘text_encoder.pt’)))
This design preserved the unique BLIP structure with out modification whereas assembly the Neuron SDK’s I/O interface necessities via light-weight wrapper lessons. It additionally enabled a modular, component-level workflow for each compilation and deployment, permitting every BLIP submodule to be compiled and managed independently. Consequently, the usage of mannequin.text_encoder.mannequin is crucial through the compilation section for direct Neuron optimization, whereas the wrapper lessons deal with enter and output formatting throughout inference to make sure clean execution on Inferentia2.
Stress testing
To validate efficiency at scale, Tomofun performed stress assessments simulating real-world Furbo digicam workloads. Every video stream triggered motion detection queries resembling “Is the canine barking?”, “Is the canine taking part in?”, or “Is the canine chewing furnishings?”. These assessments confirmed that Inf2 cases (one Inferentia2 chip, 32 GB reminiscence) may maintain the required throughput whereas sustaining low latency. Along with accuracy, the assessments highlighted that the Inf2 deployment may deal with simultaneous requests throughout lots of of 1000’s of gadgets, making it well-suited for Furbo’s always-on world buyer base. Importantly, the comparability baseline was working GPU-based cases with an on-demand pricing mannequin, which mirrored the fee Tomofun was paying earlier than migration to Inf2. By migrating from these GPU on-demand deployments to Inf2.xlarge cases with Inferentia2, Tomofun achieved 83% price discount with out compromising efficiency.
The chart illustrates how inference latency adjustments as server and consumer concurrency improve. The X-axis represents combos of the labels characterize #server threads – #consumer threads to simulate efficiency beneath totally different load situations. When only some server threads can be found, including extra consumer threads causes latency to rise shortly. Rising the variety of server threads helps soak up this load and retains latency decrease. At greater concurrency ranges, latency will increase and good points stage off, indicating saturation. This experiment reveals that groups ought to use load testing to determine the correct steadiness between consumer concurrency and server capability, after which restrict concurrency to that vary to attain the correct latency–price tradeoff in manufacturing.
Conclusion
By migrating BLIP inference on AWS Inferentia-based EC2 Inf2 cases, Tomofun diminished their Furbo utility deployment prices by 83%. The transition from GPU to Inferentia2 was seamless, because the migration required solely light-weight wrapper lessons and left BLIP’s core logic untouched. Testing confirmed that utilizing Inferentia2 not solely diminished the deployment prices, but additionally maintained excessive throughput for real-time inference at scale. Tomofun plans emigrate extra workloads to Inferentia2 because it helps workloads past vision-language fashions, resembling audio occasion detection for barking recognition and potential future integration with giant language fashions to boost pet-owner interactions. Moreover, the adoption of AWS Deep Studying Containers (DLCs) has been scheduled into the roadmap as a subsequent step, utilizing pre-built, improved container photos to simplify dependency administration and streamline inference workflows.
To discover ways to implement comparable enhancements, discover the AWS Neuron documentation and examples you may reference AWS Neuron Doc. You too can go to Furbo web site to discover Furbo’s AI-powered options and see how the Furbo ecosystem retains your pets secure.
In regards to the authors
Chen-Hsin Ding is a Workers Machine Studying Engineer at Tomofun, with over 10 years of software program improvement expertise. He leads Generative AI tasks and works intently with backend groups to design sensible AI system architectures, specializing in bringing MLOps greatest practices into the AI staff and delivering production-ready LLM and RAG functions. Outdoors of labor, Chen-Hsin enjoys brewing espresso and listening to film soundtracks and jazz on his hi-fi audio system.
Ray Wang is a Senior Options Architect at AWS. With 15 years of expertise within the IT trade, Ray is devoted to constructing trendy options on the cloud, particularly in NoSQL, huge knowledge, machine studying, and Generative AI. As a hungry go-getter, he handed all 12 AWS certificates to make his technical subject not solely deep however extensive. He likes to learn and watch sci-fi motion pictures in his spare time.
Howard Su is a Options Architect at AWS. With intensive expertise in software program improvement and system operations, he has served in numerous roles together with RD, QA, and SRE. Howard has been chargeable for the architectural design of quite a few large-scale programs and has led a number of cloud migrations. Following years of deep technical accumulation, he’s now devoted to advocating for DevOps by leveraging Generative AI to construct self-healing, “AI-Native” infrastructures, transitioning the SDLC from conventional orchestration to a very clever, predictive ecosystem.

