How Bark.com and AWS collaborated to construct a scalable video technology answer

This submit is cowritten with Hammad Mian and Joonas Kukkonen from Bark.com.

When scaling video content material creation, many corporations face the problem of sustaining high quality whereas lowering manufacturing time. This submit demonstrates how Bark.com and AWS collaborated to unravel this drawback, displaying you a replicable method for AI-powered content material technology. Bark.com used Amazon SageMaker and Amazon Bedrock to rework their advertising content material pipeline from weeks to hours.

Bark connects hundreds of individuals every week with skilled providers, from landscaping to domiciliary care, throughout a number of classes. When Bark’s advertising staff recognized a chance to broaden into mid-funnel social media promoting, they confronted a scaling drawback: efficient social campaigns require excessive volumes of personalised artistic content material for speedy A/B testing, however their guide manufacturing workflow took weeks per marketing campaign and couldn’t assist a number of buyer section variations.

In the event you’re going through related content material scaling challenges, this structure sample generally is a helpful start line. Working with the AWS Generative AI Innovation Heart, Bark developed an AI-powered content material technology answer that demonstrated a considerable discount in manufacturing time in experimental trials whereas bettering content material high quality scores. The collaboration focused 4 aims:

Manufacturing time – Scale back from weeks to hours
Personalization scale – Help a number of buyer micro-segments per marketing campaign
Model consistency – Preserve voice and visible id throughout generated content material
High quality requirements– Match professionally produced ads

On this submit, we stroll you thru the technical structure we constructed, the important thing design selections that contributed to success, and the measurable outcomes achieved, providing you with a blueprint for implementing related options.

Resolution overview

Bark collaborated with the AWS Generative AI Innovation Heart to develop an answer that might deal with these content material scaling challenges. The staff designed a system utilizing AWS providers and tailor-made AI fashions. The next diagram illustrates the answer structure.

The answer structure consists of the next built-in layers:

Information and storage layer – Amazon Easy Storage Service (Amazon S3) shops belongings together with coaching knowledge, generated video segments, reference pictures, and ultimate outputs. Mannequin artifacts and customized inference containers are saved in Amazon Elastic Container Registry (Amazon ECR).
Processing layer – AWS Lambda orchestrates the multi-stage pipeline, with AWS Step Capabilities managing the workflow state throughout the seven-step technology course of. Amazon Bedrock with Anthropic’s Claude Sonnet 3.7 handles textual content technology duties, together with buyer segmentation, story technology, and high quality analysis.
GPU compute layer – To serve Wan 2.1 Text2Video-14B reliably, we run a multi-GPU inference container that shards the mannequin throughout eight GPUs on a single p4de.24xlarge SageMaker occasion utilizing tensor parallelism. TorchServe fronts the endpoint for request dealing with, and torchrun launches one employee course of per GPU. We use Totally Sharded Information Parallel (FSDP) sharding—a method for splitting the mannequin elements throughout GPUs—for the textual content encoder and the diffusion transformer to remain inside GPU reminiscence limits with out offloading weights to CPU. As a result of video diffusion is long-running, the endpoint is tuned with an prolonged inference timeout and an extended container startup health-check window to accommodate mannequin load time and assist keep away from untimely restarts. Amazon Elastic Container Service (Amazon ECS) containers on GPU-enabled g5.2xlarge cases deal with speech synthesis for narrator voice technology, scaling to zero throughout idle intervals.
Consumer interface layer – A React frontend with Amazon Cognito authentication supplies a video studio interface the place advertising groups can evaluate, edit, and approve generated content material via pure language instructions.

Inventive ideation pipeline

Now that you simply perceive the general structure, let’s study how one can implement the artistic ideation pipeline in your individual setting. The pipeline transforms buyer questionnaire knowledge into production-ready storyboards via three levels.

Stage 1: Customizeder section technology

The pipeline begins by analyzing Bark’s buyer questionnaire knowledge utilizing Amazon Bedrock with Anthropic’s Claude Sonnet 3.7. The massive language mannequin (LLM) processes survey responses to establish distinct buyer personas with structured attributes together with demographics, motivations, ache factors, and decision-making components. For instance, within the domiciliary care class, the system recognized segments comparable to:

The Overwhelmed Household Caregiver – Adults of their 40s–50s balancing work tasks with caring for getting old dad and mom, prioritizing reliability and belief
The Independence-Centered Senior – Aged people searching for to keep up autonomy whereas acknowledging the necessity for infrequent help

Every section profile is reviewed within the UI via a human-in-the-loop course of and serves as enter to subsequent artistic ideation, creating ads that resonate with recognized viewers traits.

Stage 2: Inventive transient technology

Given the enterprise class and goal section, the system generates 4–6 artistic ideas with various levels of abstraction—encouraging each literal and metaphorical approaches. We configure the mannequin with excessive temperature sampling (0.8–1) to encourage divergent considering. The mannequin employs chain-of-thought reasoning, explicitly evaluating idea relevance, engagement potential, and leisure worth earlier than producing briefs. This produces numerous narrative approaches to the identical industrial goal, comparable to simple testimonial codecs or emotionally resonant metaphorical tales.

Stage 3: Storyboard refinement

The ultimate stage transforms generic artistic briefs into segment-specific storyboards. A stochastic function sampling mechanism—which randomly determines which attributes to focus on—identifies which buyer section attributes to emphasise, sustaining range whereas addressing particular motivations and ache factors. The system performs specific brief-to-segment matching via prompted reasoning earlier than producing the ultimate storyboard with full audiovisual specs—together with scene descriptions, digital camera instructions, narration textual content, and timing. Human evaluate at this stage confirms model alignment earlier than manufacturing begins.

Sustaining visible consistency throughout scenes

A 30-second commercial incorporates 4–6 distinct scenes, that are greatest generated individually. With out cautious orchestration, AI fashions exhibit semantic drift—characters change look, backgrounds shift unexpectedly, and model components turn into inconsistent. Our answer implements a two-tier consistency framework.

Semantic consistency

You possibly can rework artistic briefs into video prompts via a three-stage course of:

Ingredient extraction – An LLM analyzes the storyboard to establish atomic decor components—actors, props, objects, and places—and flags these requiring consistency throughout scenes.
Blueprint technology – For every recurring aspect, the system generates detailed specification blueprints, establishing canonical visible representations.
Immediate transformation – Excessive-level scene descriptions are reworked into detailed video technology prompts, incorporating each the unique artistic transient (for narrative adherence) and standardized decor specs (for visible consistency).

Visible consistency

Though semantic consistency via detailed prompts considerably reduces drift, video technology fashions nonetheless exhibit interpretive latitude even below equivalent immediate specs. To handle this limitation, we implement a reference picture extraction and propagation pipeline, as illustrated within the following diagram.

The pipeline consists of the next levels:

Optimum body identification – Amazon Nova Premier analyzes generated scenes to establish frames the place goal components seem most clearly.
Ingredient segmentation – The open-source Section Something Mannequin, deployed on Amazon ECS, isolates goal components from backgrounds.
Reference propagation – Extracted reference pictures are fed to subsequent video technology calls utilizing Wan 2.2’s reference-to-video capabilities.

This dual-constraint method—combining semantic specification via detailed prompts with visible specification via reference pictures—creates a strong consistency framework that we validated via systematic ablation research.

The video technology pipeline

The pipeline orchestrates 5 modalities—textual content, picture, video, audio, and overlay graphics—with strategic mannequin choice based mostly on scene necessities:

Reference-to-video synthesis – Scenes requiring visible continuity use Wan 2.1 VACE- 14B with extracted reference pictures
Textual content-to-video technology – Scenes introducing new components use Wan 2.1 Text2Video-14B

A Step Capabilities workflow sequences technology to confirm reference pictures can be found earlier than dependent scenes start.

Speech synthesis and graphics

Speech synthesis makes use of Sesame AI Lab’s Conversational Speech Mannequin on GPU-enabled ECS cases (g5.2xlarge). Voice cloning requires a 10-second reference pattern of Bark’s model narrator; the mannequin extracts speaker embeddings that can be utilized to situation subsequent technology. Amazon ECS scales to zero throughout idle intervals, assuaging prices exterior energetic technology home windows. As well as, textual content overlays and call-to-action graphics use template techniques that assist typographic consistency with Bark’s model tips. These components are composited throughout ultimate meeting.

High quality analysis loop

An LLM-as-a-judge analysis loop in Lambda assesses every scene throughout three dimensions:

Narrative adherence – Accuracy to storyboard description
Visible high quality – Absence of artifacts and inconsistencies
Model compliance – Alignment with model tips

Scenes falling under configurable high quality thresholds set off automated regeneration whereas preserving visible reference components. This iterative refinement continues till scenes meet high quality requirements or human evaluate is requested.

Messaging constant touchdown pages

Utilizing generated movies and the shopper section as the bottom, we created an agentic system to generate personalised touchdown pages. We used a Strands agent to carry out the next actions:

Generate the TypeScript code for the web page
Take optimum screenshots from the video to incorporate within the web page (for the person to know it’s associated on to the video advert that they had seen)
Amend the wording and design to align with the shopper section

Outcomes

We evaluated AI-generated content material in opposition to Bark’s current marketing campaign library. The next desk summarizes the outcomes.

Analysis Dimension
AI-Generated Adverts
Present Marketing campaign Library

Story Construction C Coherence
6.9 ± 0.49
6.4 ± 0.74

Originality C Engagement
6.5 ± 1.23
5.2 ± 1.22

Visible C Spatial Consistency
6.9 ± 0.74
6.6 ± 0.75

Scores on a 10-point scale with 95% confidence intervals

The outcomes display that AI-generated content material achieved increased narrative coherence scores, validating the hierarchical scene planning method. The 25% enchancment in originality scores suggests the artistic ideation pipeline efficiently balances novelty with industrial viability. The reference picture propagation system delivered measurably increased character and setting consistency than guide manufacturing.

Finish-to-end, the pipeline generates a 15–30 second advert in roughly 12–quarter-hour on ml.p4d.24xlarge SageMaker cases; this contains orchestration (reference extraction/segmentation), automated high quality checks, and regeneration loops—not only a single mannequin name. Multi-GPU sharding (8-way tensor parallel) retains per-scene technology within the seconds-to-a-few-minutes vary by becoming the 14B mannequin totally in GPU reminiscence and accelerating the heavy consideration/denoising compute. Operating it behind a SageMaker real-time endpoint retains the mannequin heat between requests and helps keep away from latency from repeated mannequin hundreds, and long-inference timeouts and startup well being checks scale back failures and retries for long-running diffusion calls.

Ablation examine

To validate every architectural determination, we carried out systematic ablation research. The next desk summarizes the outcomes.

Configuration
Story Coherence
Engagement
Visible Consistency

Full system
6.9
6.5
6.9

With out reference picture propagation
7.5
4.8
6.7

With out narrative aspect extraction
7.6
4.5
6.4

With out hierarchical scene planning
7.0
4.5
6.5

The outcomes reveal that eradicating reference picture propagation considerably impacts engagement scores (from 6.5 to 4.8), indicating constant character illustration helps extra refined narrative improvement. Disabling narrative aspect extraction induced essentially the most extreme engagement degradation whereas barely bettering structural scores—suggesting structured narrative evaluation helps artistic risk-taking whereas sustaining coherent storylines.

What this implies in your implementation

Primarily based on our expertise, the next are actionable tips in your personal video technology initiatives:

Human-in-the-loop is important – Though the system automates the majority of manufacturing time, human intervention at artistic transient approval and ultimate evaluate confirms model alignment.
Reference picture high quality issues greater than amount – Our adaptive reference extraction system dynamically identifies optimum frames via multi-criteria evaluation (visible readability, lighting, aspect prominence). Poor reference pictures propagate errors all through the video sequence.
LLM-as-a-judge helps speedy iteration – Conventional video analysis is pricey and sluggish. Utilizing Anthropic’s Claude to judge generated content material in opposition to structured standards supported speedy experimentation with totally different technology approaches.
Design for compound consistency challenges – Single-character consistency has been deeply researched; the more durable drawback is sustaining consistency of compound components like furnished rooms the place a number of visible attributes should coexist. Plan structure round these advanced circumstances.

Cleansing up

In the event you replicate this answer in your individual setting, bear in mind to delete sources while you’re completed experimenting to keep away from ongoing prices:

Delete SageMaker endpoints.
Take away S3 buckets containing generated belongings.
Terminate Amazon ECS providers and process definitions.
Delete Lambda features and Step Capabilities state machines.

Conclusion

This collaboration establishes a replicable sample for AI-assisted artistic manufacturing utilizing AWS providers. The core architectural perception—combining semantic consistency via hierarchical immediate planning with visible consistency via reference picture propagation—addresses elementary challenges in multi-scene video technology that reach past promoting into numerous domains requiring coherent, prolonged narratives.For Bark, the answer, below enterprise analysis, has the potential to assist speedy experimentation with personalised social media campaigns, supporting their enlargement into mid-funnel advertising channels.

To get began constructing the same answer, contemplate the next subsequent steps:

Acknowledgement

Particular due to Giuseppe Mascellero and Nikolas Zavitsanos for his or her contribution.

Concerning the authors

Zainab Afolabi

Zainab Afolabi is a Senior Information Scientist on the Generative AI Innovation Centre in London, the place she leverages her intensive experience to develop transformative AI options throughout numerous industries. She has over 9 years of specialized expertise in synthetic intelligence and machine studying, in addition to a ardour for translating advanced technical ideas into sensible enterprise functions.

Margherita Rosnati

Margherita Rosnati is an Utilized Scientist within the Customized Mannequin Optimization staff on the AWS Generative AI Innovation Heart. With a PhD in Machine Studying for Medical Imaging, she specialises in constructing tailor-made AI and ML options throughout imaging, video, and pure language processing for enterprise prospects

Laksh Puri

Laksh Puri is a Senior Generative AI Strategist on the AWS Generative AI Innovation Heart, based mostly in London. He works with massive organizations throughout EMEA on their AI technique, together with advising govt management to outline and deploy impactful generative AI options.

Hammad Mian

Hammad Mian is at the moment CMO at Bark.com – has over 20 years of business and advertising expertise, centered on driving progress for client and expertise companies.

Joonas Kukkonen

Joonas Kukkonen is CTO at Bark.com. With a profession spanning management roles at Bark.com, Busuu and Spotify, he has over 20 years of expertise constructing on-line merchandise for shoppers and companies.

What's Hot

Meta’s newest good glasses characteristic would have been excellent on the Winter Olympics — however I’m annoyed it could actually’t be utilized by everybody

Tinder Plans to Let AI Scan Your Digital camera Roll

‘Motion by no means lies’ : NPR

Tinder Plans to Let AI Scan Your Digital camera Roll

(Free) Agentic Coding with Goose

Enhanced metrics for Amazon SageMaker AI endpoints: deeper visibility for higher efficiency

Prime 5 GitHub Repositories for Free Claude Abilities (1000+ Abilities)

A Coding Information to Implement Superior Differential Equation Solvers, Stochastic Simulations, and Neural Unusual Differential Equations Utilizing Diffrax and JAX

Uber faucets Rivian to construct robotaxis in deal price as much as $1.25B

Meta’s newest good glasses characteristic would have been excellent on the Winter Olympics — however I’m annoyed it could actually’t be utilized by everybody

Tinder Plans to Let AI Scan Your Digital camera Roll

‘Motion by no means lies’ : NPR

Meta’s newest good glasses characteristic would have been excellent on the Winter Olympics — however I’m annoyed it could actually’t be utilized by everybody

Tinder Plans to Let AI Scan Your Digital camera Roll

‘Motion by no means lies’ : NPR

Usefull link

categories

What's Hot

Resolution overview

Inventive ideation pipeline

Stage 1: Customizeder section technology

Stage 2: Inventive transient technology

Stage 3: Storyboard refinement

Sustaining visible consistency throughout scenes

Semantic consistency

Visible consistency

The video technology pipeline

Speech synthesis and graphics

High quality analysis loop

Messaging constant touchdown pages

Outcomes

Ablation examine

What this implies in your implementation

Cleansing up

Conclusion

Acknowledgement

Concerning the authors

Zainab Afolabi

Margherita Rosnati

Laksh Puri

Hammad Mian

Joonas Kukkonen

Related Posts

Usefull link

categories