Manufacturing machine studying (ML) groups wrestle to hint the complete lineage of a mannequin by the information and the code that educated it, the precise dataset model it consumed, and the experiment metrics that justified its deployment. With out this traceability, questions like “which knowledge educated the mannequin at the moment in manufacturing?” or “can we reproduce the mannequin we deployed six months in the past?” develop into multi-day investigations by scattered logs, notebooks, and Amazon Easy Storage Service (Amazon S3) buckets. This hole is very acute in regulated industries. For instance, healthcare, monetary companies, autonomous automobiles, the place audit necessities demand that you just hyperlink deployed fashions to their exact coaching knowledge, and the place particular person information would possibly have to be excluded from future coaching on request.
On this put up, we present mix three instruments to shut this hole:
We stroll by two deployable patterns, dataset-level lineage and record-level lineage, which you could run end-to-end in your personal AWS account utilizing the companion notebooks.
Resolution overview
The structure integrates DVC, SageMaker AI, and SageMaker AI MLflow App right into a single workflow the place each mannequin is traceable again to its precise coaching knowledge.
Every device performs a definite position:
Instrument
Function
What it shops
DVC
Information and artifact versioning
Light-weight .dvc metafiles in Git; precise knowledge in Amazon S3
Amazon SageMaker AI
Scalable compute for processing, coaching, and internet hosting
Processing/Coaching job orchestration and mannequin Internet hosting
Amazon SageMaker AI MLflow App
Experiment monitoring, mannequin registry, lineage
Parameters, metrics, artifacts, registered fashions
The info flows by 4 levels:
- A SageMaker AI Processing job preprocesses uncooked knowledge and variations the processed dataset with DVC, pushing the information to S3 and metadata to a Git repository.
- A SageMaker AI Coaching job clones the DVC repository at a particular Git tag, runs dvc pull to retrieve the precise versioned dataset, trains the mannequin, and logs all the pieces to MLflow.
- Each MLflow coaching run information the data_git_commit_id, which is the DVC commit hash that factors to the precise dataset in Amazon S3.
- The educated mannequin is registered within the MLflow Mannequin Registry and could be deployed to a SageMaker AI endpoint.
This creates a whole traceability chain: Manufacturing Mannequin → MLflow Run → DVC commit → precise dataset in Amazon S3.
Stipulations
You have to have the next conditions to comply with together with this put up:
- An AWS account with permissions for Amazon SageMaker (Processing, Coaching, MLflow Apps, Endpoints), Amazon S3, AWS CodeCommit, and AWS Id Entry Administration (IAM).
- Python 3.11 or Python 3.12.
- The SageMaker Python SDK v3.4.0 or later.
The companion repository features a necessities.txt with all dependencies. If working outdoors SageMaker Studio, your IAM position will need to have a belief relationship permitting sagemaker.amazonaws.com to imagine it.
Word on Git suppliers: The notebooks use AWS CodeCommit because the Git backend for DVC metadata. Nonetheless, DVC works with different Git suppliers (GitHub, GitLab, Bitbucket). All you have to do is substitute the git distant add origin URL and configure acceptable credentials. For instance, by storing tokens in AWS Secrets and techniques Supervisor and fetching them at runtime or through the use of AWS CodeConnections. The important thing requirement is that your SageMaker AI execution position can entry the Git repository or has permissions to make use of AWS CodeConnections.
How DVC and SageMaker AI MLflow work collectively
The important thing perception behind this structure is that DVC and MLflow every resolve half of the lineage downside, and collectively they shut the loop.
DVC (Information Model Management) is a no price, open supply device that extends Git to deal with massive datasets and ML artifacts. Git alone can’t handle massive binary information as a result of repositories develop into bloated and sluggish, and methods like GitHub block information over 100 MB. DVC addresses this by codification: it tracks light-weight .dvc metafiles in Git (content-addressable pointers) whereas the precise knowledge lives in distant storage akin to Amazon S3. This offers you Git-like versioning semantics (branching, tagging, diffing) for datasets that may be gigabytes or terabytes in measurement, with out bloating your repository.
Storage effectivity:
DVC makes use of content-addressable storage (MD5 hashes), so it shops solely new or modified information fairly than duplicating complete datasets. Recordsdata with an identical contents are saved solely as soon as within the DVC cache, even when they seem below completely different names or throughout completely different dataset variations. For instance, including 1,000 new pictures to an present dataset solely uploads these new information to S3. The unchanged information aren’t re-uploaded. Nonetheless, if a preprocessing step modifies present information, the affected information get new hashes and are saved as new objects.
Past knowledge versioning, DVC additionally helps reproducible knowledge pipelines, experiment administration, and may function an information registry for sharing datasets throughout groups. On this structure, we use DVC particularly for its knowledge versioning functionality. Each time you model a dataset with dvc add and commit the ensuing .dvc file, you create a Git commit that maps to a particular dataset state. Tagging that commit offers you a steady reference you possibly can return to with git checkout && dvc pull. For a deeper dive into DVC’s versioning capabilities, see the Versioning Information and Fashions information.
SageMaker AI MLflow App is a absolutely managed AWS service supplied inside SageMaker AI Studio, for managing the end-to-end ML and generative AI lifecycle. Its core capabilities embrace experiment monitoring (logging parameters, metrics, and artifacts for each coaching run), a mannequin registry with versioning and lifecycle stage administration, mannequin analysis, and deployment integrations. On this put up’s structure, we use MLflow for full experiment monitoring together with DVC outcomes and the mannequin registry. By logging the DVC commit hash as a parameter (data_git_commit_id) on each coaching run, we create the bridge: fashions within the MLflow registry could be traced again to the precise Git tag, which maps to the precise dataset in S3.
Whereas DVC can deal with each knowledge versioning and experiment monitoring by itself, MLflow brings a extra mature mannequin registry with mannequin versioning, aliases for lifecycle administration, and deployment integrations. Through the use of DVC for knowledge versioning and MLflow for mannequin lifecycle administration, we get a clear separation of issues: DVC owns the data-to-training lineage, MLflow owns the training-to-deployment lineage, and the Git commit hash ties them collectively.
Sample one: Dataset-level lineage (foundational)
Earlier than constructing the combination, it’s important to know how DVC’s dataset versioning and MLflow’s run monitoring complement one another in forming a full lineage. The foundational pocket book demonstrates the core sample by simulating a typical state of affairs: beginning with restricted labeled knowledge and increasing over time.
The workflow
The pocket book runs two experiments utilizing the CIFAR-10 picture classification dataset:
- v1.0: Course of and practice with 5% of the information (~2,250 coaching pictures)
- v2.0: Course of and practice with 10% of the information (~4,500 coaching pictures)
For every model, the identical two-step pipeline executes:
Step 1 — Processing job: A SageMaker Processing job downloads CIFAR-10, samples the configured fraction, splits into practice/validation/take a look at units, saves pictures in ImageFolder format, and variations the end result with DVC. The processed dataset is pushed to S3 by way of dvc push, and the Git metadata (together with a singular tag like v1.0-02-24-26_1430) is pushed to CodeCommit.
The processing job receives the DVC repository URL and MLflow monitoring URI as atmosphere variables:
processor_v1 = FrameworkProcessor(
image_uri=processing_image,
position=position,
instance_type=”ml.m5.xlarge”,
instance_count=1,
env={
“DVC_REPO_URL”: dvc_repo_url,
“DVC_REPO_NAME”: dvc_repo_name,
“MLFLOW_TRACKING_URI”: mlflow_app_arn,
“MLFLOW_EXPERIMENT_NAME”: experiment_name,
“PIPELINE_RUN_ID”: pipeline_run_id_v1,
}
)
processor_v1.run(
code=”preprocessing_foundational.py”,
source_dir=”../source_dir”,
arguments=[
“–data-fraction”, str(data_fraction_v1),
“–data-version”, data_version_v1,
“–val-split”, “0.1”
],
wait=True
)
Contained in the processing script, after preprocessing, the dataset is versioned with DVC and the commit hash is logged to MLflow:
def version_with_dvc(repo_path, version_tag, pipeline_run_id):
“””Add knowledge to DVC and push to distant.”””
subprocess.check_call([“dvc”, “add”, “dataset”], cwd=repo_path)
subprocess.check_call([“git”, “add”, “dataset.dvc”, “.gitignore”], cwd=repo_path)
subprocess.check_call(
[“git”, “commit”, “-m”, f”Add dataset version {version_tag}”],
cwd=repo_path
)
subprocess.check_call([“git”, “tag”, pipeline_run_id], cwd=repo_path)
subprocess.check_call([“dvc”, “push”], cwd=repo_path)
subprocess.check_call([“git”, “push”, “origin”, “main”], cwd=repo_path)
subprocess.check_call([“git”, “push”, “origin”, pipeline_run_id], cwd=repo_path)
commit_id = subprocess.check_output(
[“git”, “rev-parse”, “HEAD”], cwd=repo_path
).decode().strip()
return commit_id
Step 2 — Coaching job: A SageMaker AI Coaching job clones the DVC repository on the precise tag from Step 1, runs dvc pull to obtain the versioned dataset, and fine-tunes a pretrained MobileNetV3-Small mannequin. The coaching script logs the parameters (together with the DVC commit hash), per-epoch metrics, and the educated mannequin to MLflow. The mannequin is routinely registered within the MLflow Mannequin Registry.
The essential lineage bridge (logging the DVC commit hash to MLflow), occurs within the coaching script:
# Fetch knowledge: clone DVC repo on the precise tag, then dvc pull
data_git_commit_id = fetch_data_from_dvc()
with mlflow.start_run(run_name=run_name) as run:
mlflow.log_params({
“data_version”: data_version,
“data_git_commit_id”: data_git_commit_id, # <– the lineage bridge
“dvc_repo_url”: dvc_repo_url,
“model_architecture”: “mobilenet_v3_small”,
“epochs”: args.epochs,
“learning_rate”: args.learning_rate,
# …
})
What you see in MLflow
After each experiments full, the MLflow UI exhibits each runs side-by-side, as proven within the following screenshot. Within the MLflow experiment, you possibly can examine:
- Coaching and validation accuracy curves throughout knowledge variations
- The precise hyperparameters and knowledge model for every run
- The data_git_commit_id that hyperlinks every mannequin to its DVC dataset
Choosing right into a run exhibits the complete element, loss curves, parameters, and the DVC commit linking to the precise dataset in S3, as proven within the following screenshot.
Lastly, educated synthetic intelligence and machine studying (AI/ML) Fashions are routinely registered within the MLflow Mannequin Registry with model historical past and hyperlinks to the coaching run that produced them, as proven within the following screenshot. Moreover, with SageMaker AI MLflow App built-in with SageMaker AI Mannequin Registry, the MLflow routinely logs the registered mannequin into SageMaker AI Mannequin Registry.
Deploying the mannequin
The pocket book deploys the really useful mannequin (v2.0, educated on extra knowledge) from the MLflow Mannequin Registry to a SageMaker AI real-time endpoint utilizing ModelBuilder. After deployed, you possibly can invoke the endpoint with uncooked picture bytes and get again class predictions. The total deployment and inference code is within the pocket book.
What this sample solutions
With dataset-level lineage, you possibly can reply:
- “Which dataset model educated this mannequin?” — Search for the data_git_commit_id within the MLflow run
- “Can I reproduce this mannequin’s coaching knowledge?” — Run git checkout && dvc pull to revive the precise dataset
- “Why did mannequin efficiency change?” — Evaluate runs in MLflow and hint every to its knowledge model
What it doesn’t reply with out further work: “Was document X on this mannequin’s coaching knowledge?” You’d want to tug the complete dataset and search by it. That’s the place Sample two is available in.
Sample two: Report-level lineage (healthcare compliance)
Sample 2 builds straight on the dataset-level strategy, including document/patient-level traceability by manifests and consent registries. The instance healthcare compliance pocket book extends the foundational sample for regulated environments the place you have to hint particular person information, not solely datasets, by the ML lifecycle.
The important thing addition: a manifest
The distinction is a manifest. A manifest is a structured CSV itemizing each particular person document in every dataset model:
patient_id,scan_id,file_path,break up,label
PAT-00001,PAT-00001-SCAN-0001,practice/regular/00042.png,practice,regular
PAT-00023,PAT-00023-SCAN-0015,practice/tubercolosis/00015.png,practice,tubercolosis
…
This manifest is saved contained in the DVC-versioned dataset listing and logged as an MLflow artifact on each coaching run. This makes particular person information queryable straight from MLflow with out pulling the complete dataset from DVC.
The consent registry
The workflow is pushed by a consent registry, which is a CSV file itemizing every affected person and their consent standing. In manufacturing, this is able to be a database with transactional commitments, its personal audit path, and probably event-driven triggers to provoke re-training. The CSV strategy right here is streamlined for demonstration functions, however the integration sample is similar: the processing job reads the registry and solely consists of information with energetic consent.
The processing code is idempotent. It doesn’t know or care about opt-outs, it filters for consent_status == “energetic” and processes no matter stays. An opt-out is an enter change that produces a brand new, clear dataset when the identical pipeline runs once more.
The opt-out workflow
The pocket book demonstrates a whole opt-out cycle:
- v1.0 — Baseline – Course of and practice with all consented sufferers. The manifest lists the persistence scans. The mannequin is registered in MLflow with the manifest as an artifact.
- Choose-out occasion – Affected person PAT-00023 requests to decide out. Their consent standing is up to date to revoked within the registry, and the up to date registry is uploaded to S3.
- v2.0 — Clear dataset – The identical processing job runs with the up to date registry. PAT-00023‘s pictures are routinely excluded. DVC variations the brand new dataset (137 sufferers). The mannequin is retrained and registered as a brand new model in MLflow.
- Audit verification – Question MLflow to verify PAT-00023 seems solely within the v1.0 mannequin and is absent from fashions educated after the opt-out date.
Audit queries
The companion utils/audit_queries.py module gives three question features that work by downloading manifest artifacts from MLflow:
- find_models_with_patient(“PAT-00023”) — Searches the coaching runs for a affected person ID. Returns solely the v1.0 run.
- verify_patient_excluded_after_date(“PAT-00023”, “2025-06-01”) — Checks the fashions educated after a date and confirms that the affected person is absent. Returns PASSED or FAILED with particulars.
- get_patients_in_model(run_id) — Lists the affected person IDs in a particular mannequin’s coaching knowledge.
from utils.audit_queries import find_models_with_patient
# “Which fashions have been educated on this affected person’s knowledge?”
find_models_with_patient(“PAT-00023″, experiment_name=”demo-cxr-mlflow-dvc”)
These queries don’t require a DVC checkout — they function completely on MLflow artifacts, making them quick sufficient for interactive audit responses.
Manufacturing observe: The earlier queries obtain the manifest.csv artifact from each coaching run and scan it. This works for a handful of runs however doesn’t scale. In manufacturing, take into account writing (record_id, run_id, data_version) tuples to Amazon DynamoDB at coaching time, pointing Amazon Athena on the MLflow artifact prefix in S3, or utilizing a post-training AWS Lambda to populate an index.
What this sample solutions
Past all the pieces the foundational sample gives, record-level lineage solutions:
- “Which fashions have been educated utilizing affected person X’s scans?” — On the spot question throughout MLflow runs
- “Confirm that affected person X was excluded from all fashions after their opt-out date” — Automated cross/fail audit
- “Record each document in mannequin Y’s coaching knowledge” — Obtain the manifest artifact
Whereas this demo makes use of healthcare terminology, the sample applies to different domains requiring record-level traceability: monetary companies, content material moderation (user-submitted content material), or different ML methods topic to knowledge deletion requests.
Finest practices and governance
The three-layer traceability chain
The built-in workflow creates traceability at three ranges:
- Git + DVC layer – Each dataset model is a Git tag pointing to a DVC commit. Operating git checkout && dvc pull restores the precise processed knowledge.
- MLflow layer – Each coaching run information the data_git_commit_id, linking the mannequin to its DVC knowledge model. The record-level manifest (when used) makes particular person information queryable.
- Mannequin Registry layer – Each registered mannequin model hyperlinks to its coaching run, which hyperlinks to its knowledge model.
Safety concerns for regulated environments
DVC and MLflow present traceability and experiment monitoring however aren’t tamper-evident on their very own. For regulated deployments (HIPAA, FDA 21 CFR Half 11, GDPR), layer on infrastructure-level controls:
- S3 Object Lock (compliance mode) on DVC remotes and MLflow artifact shops to keep away from modification or deletion of versioned knowledge and mannequin artifacts
- AWS CloudTrail for impartial, append-only logging of entry to storage and coaching infrastructure
- IAM insurance policies imposing least-privilege entry to manufacturing buckets, MLflow monitoring servers, and Git repositories
- Encryption at relaxation utilizing AWS Key Administration Service (AWS KMS) for S3 buckets storing DVC knowledge and MLflow artifacts
Dashing up iteration
When working repeated experiments (just like the v1.0 → v2.0 circulate), two SageMaker AI options assist streamline the method:
- SageMaker Managed Heat Swimming pools — Hold coaching situations heat between jobs so back-to-back coaching runs reuse already-provisioned infrastructure. Add keep_alive_period_in_seconds to your Compute config to allow it. Word that heat swimming pools apply to coaching jobs solely, not processing jobs.
- SageMaker AI Pipelines — Orchestrate the processing → coaching → registration workflow as a single, repeatable pipeline. Pipelines deal with step dependencies, cross artifacts between steps routinely, and could be triggered programmatically (for instance, when a affected person opts out and the manifest is up to date).
Cleanup
To keep away from ongoing prices, delete the assets created in the course of the walkthrough: the SageMaker AI endpoint, the MLflow App (elective), the AWS CodeCommit repository, and the S3 knowledge. The notebooks embrace cleanup cells with the precise instructions. The first price driver is the SageMaker AI real-time endpoint. Be certain that to delete it promptly after testing.
Conclusion
On this put up, we demonstrated construct an end-to-end MLOps workflow that mixes DVC for knowledge versioning, Amazon SageMaker AI for scalable coaching and orchestration, and SageMaker AI MLflow Apps for experiment monitoring and mannequin registry.The important thing outcomes:
- Full reproducibility – Fashions could be traced again to its precise coaching knowledge by way of DVC commit hashes saved in MLflow.
- Report-level lineage – The manifest sample allows querying which particular person information educated a given mannequin. That is essential for opt-out compliance and audit responses.
- Stateless compliance alignment – The consent registry sample handles document exclusion with out altering processing code. An opt-out is an enter change that flows by the identical pipeline.
- Experiment comparability – MLflow gives side-by-side comparability of fashions educated on completely different knowledge variations, with full parameter and metric monitoring.
The 2 notebooks within the companion GitHub repository are deployable as-is. The foundational sample fits groups that want dataset-level traceability. The healthcare compliance sample extends it for regulated environments requiring record-level audit trails. Each share the identical SageMaker AI coaching code and structure.
Whereas the notebooks exhibit an interactive workflow, the identical sample integrates straight into automated pipelines. SageMaker AI Pipelines can orchestrate the processing and coaching steps, with DVC tagging and MLflow logging occurring identically inside every job. The lineage chain stays the identical whether or not triggered from a pocket book or a SageMaker AI Pipeline.
Concerning the authors
Manuwai Korber
Manuwai Korber is an AI/ML Specialist Options Architect at AWS with a background in ML engineering. He helps prospects architect production-grade AI/ML methods throughout the complete mannequin lifecycle — from experimentation, coaching and fine-tuning by to serving and manufacturing deployment. As well as, constructing GenAI-powered purposes and agentic AI methods.
Paolo Di Francesco
Paolo Di Francesco is a Senior Options Architect at Amazon Internet Companies (AWS). He holds a PhD in Telecommunications Engineering and has expertise in software program engineering. He’s obsessed with machine studying and is at the moment specializing in utilizing his expertise to assist prospects attain their targets on AWS, in discussions round MLOps. Exterior of labor, he enjoys taking part in soccer and studying.
Sandeep Raveesh
Sandeep Raveesh is a GenAI Specialist Options Architect at AWS. He works with buyer by their AIOps journey throughout mannequin coaching, Retrieval-Augmented-Technology(RAG), GenAI Brokers, and scaling GenAI use-cases. He additionally focuses on Go-To-Market methods serving to AWS construct and align merchandise to unravel business challenges within the Generative AI area. You will discover Sandeep on LinkedIn.
Nick McCarthy
Nick McCarthy is a Senior Generative AI Specialist Options Architect on the Amazon Bedrock group, targeted on mannequin customization. He has labored with AWS purchasers throughout a variety of industries — together with healthcare, finance, sports activities, telecommunications, and power — serving to them speed up enterprise outcomes by the usage of AI and machine studying. Exterior of labor, Nick loves touring, exploring new cuisines, and studying about science and expertise. He holds a Bachelor’s diploma in Physics and a Grasp’s diploma in Machine Studying.

