Constructing and managing machine studying (ML) options at scale is without doubt one of the most important and sophisticated challenges in fashionable knowledge science workflows. Organizations typically battle with fragmented characteristic pipelines, inconsistent knowledge definitions, and redundant engineering efforts throughout groups. With no centralized system for storing and reusing options, fashions threat being educated on outdated or mismatched knowledge, resulting in poor generalization, decrease mannequin accuracy and governance points. Moreover, enabling collaboration throughout knowledge engineering, knowledge science, and ML operations groups turns into tough when every group maintains its personal remoted datasets and transformations.
Amazon SageMaker addresses these challenges by means of SageMaker Unified Studio and SageMaker Catalog, which organizations can use to construct, handle, and share belongings securely throughout initiatives and accounts. A key functionality inside this ecosystem is the implementation of an offline characteristic retailer—a structured repository designed for managing historic characteristic knowledge utilized in mannequin coaching and validation. Offline characteristic shops are designed for scalability, lineage monitoring, and reproducibility, in order that knowledge scientists can practice fashions on correct, time-aligned datasets that stop knowledge leakage and keep consistency throughout experiments.
This weblog publish gives step-by-step steering on implementing an offline characteristic retailer utilizing SageMaker Catalog inside a SageMaker Unified Studio area. By adopting a publish-subscribe sample, knowledge producers can use this answer to publish curated, versioned characteristic tables—whereas knowledge shoppers can securely uncover, subscribe to, and reuse them for mannequin growth. The strategy integrates Amazon S3 Tables with Apache Iceberg for transactional consistency, AWS Lake Formation for fine-grained entry management, and Amazon SageMaker Studio for visible and code-based knowledge engineering.
By means of this unified answer, groups can obtain constant characteristic governance, speed up ML experimentation, and scale back operational overhead. on this publish, we present you the best way to design a collaborative, ruled, and production-ready offline characteristic retailer to unlock enterprise-wide reuse of trusted ML options.
This answer demonstrates the best way to implement an offline characteristic retailer utilizing a SageMaker Unified Studio area built-in with SageMaker Catalog to allow scalable, ruled, and collaborative characteristic administration throughout ML groups. The structure establishes a unified atmosphere, proven within the following determine, that streamlines how directors, knowledge engineers, and knowledge scientists create, publish, and eat high-quality, reusable characteristic tables.
At its core, the answer makes use of a SageMaker Unified Studio area because the governance and collaboration layer for managing initiatives, customers, and knowledge belongings underneath centralized management. S3 Tables within the Apache Iceberg format function the inspiration for storing and versioning characteristic knowledge. SageMaker Catalog, which permits unified governance for datasets, acts because the central registry for publishing, discovering, and subscribing to characteristic tables.
The next describes how numerous personas work together within the end-to-end workflow:
- Admin makes use of AWS CloudFormation templates and the AWS Administration Console to arrange and configure the atmosphere, together with provisioning the SageMaker Unified Studio area and onboarding customers and teams.
- Admin creates the SageMaker Unified Studio knowledge venture and onboards Amazon Easy Storage Service (Amazon S3) datasets—akin to airline_delay.csv and airline_features (an S3 desk)—to the venture catalog, then designates the info engineer because the venture proprietor.
- Knowledge engineer opens the info venture, makes use of the Visible extract, rework, load (ETL) device or a knowledge processing job to construct the characteristic pipeline, and creates options within the airline_features desk throughout the venture catalog.
- Knowledge engineer accesses the info explorer device to complement the airline_features desk with metadata for improved discoverability and governance.
- After validation and approval, the info engineer publishes the airline_features desk into SageMaker Catalog for organization-wide entry.
- Knowledge scientist opens the ML venture, makes use of AI powered search to seek out the airlines_features desk within the catalog, and identifies the revealed characteristic desk as appropriate for his or her mannequin growth.
- Knowledge scientist submits a subscription request to eat the revealed characteristic desk. If auto-approval isn’t configured, the writer should overview and approve the entry request.
- After approval, the info scientist features entry to the characteristic desk by means of the venture catalog, each by means of the info explorer and immediately from Jupyter notebooks for mannequin coaching and experimentation.
This structured workflow gives constant knowledge governance, promotes collaboration, and eliminates redundant characteristic engineering efforts by enabling enterprise-wide reuse of trusted, versioned ML options.
The offline characteristic retailer answer structure consists of a number of built-in parts. Every element performs a definite function in enabling safe knowledge governance, scalable characteristic engineering, and seamless collaboration throughout ML personas. The important thing parts embrace:
SageMaker Unified Studio area: The SageMaker Unified Studio area serves because the central management aircraft for managing ML initiatives, customers, and knowledge belongings. It gives a unified interface for collaboration between knowledge engineers, knowledge scientists, and directors. This area allows enforcement of fine-grained entry controls, integrates with AWS IAM Identification Middle for single sign-on, and helps approval workflows to assist guarantee safe sharing of ML belongings throughout groups and accounts.
S3 Tables with Apache Iceberg format: S3 Tables allows scalable, serverless storage for characteristic knowledge utilizing the Apache Iceberg desk format. Apache Iceberg Open Desk Format (OTF) allows ACID transactions, schema evolution, and time-travel capabilities, which groups can then benefit from to question historic variations of characteristic knowledge with full reproducibility. S3 Tables seamlessly combine with Spark, Glue, and SageMaker for constant knowledge entry throughout analytical and ML workloads.
Function engineering pipeline: The characteristic engineering pipeline automates the transformation of uncooked datasets into curated, high-quality options. Constructed on Apache Spark, it gives distributed knowledge processing at scale, enabling complicated transformations akin to delay-rate computation, categorical encoding, and have aggregation. The pipeline immediately writes outputs to S3 tables, serving to to make sure traceability and consistency between uncooked/processed knowledge and engineered options.
SageMaker Catalog: SageMaker Catalog acts because the organization-wide repository for registering, publishing, and discovering ML belongings akin to datasets, characteristic tables, and fashions. It integrates with Lake Formation for fine-grained entry management and IAM Identification Middle for consumer administration. The catalog helps metadata enrichment, versioning, and approval workflows, so groups can securely share and reuse trusted belongings throughout initiatives.
Collectively, these parts create a cohesive ecosystem that simplifies ML characteristic lifecycle administration—from creation and publication to discovery and consumption—whereas sustaining enterprise-grade governance and knowledge lineage monitoring.
The administrator workflow defines the preliminary setup required to ascertain a safe and collaborative atmosphere for implementing the offline characteristic retailer. Directors provision the SageMaker Unified Studio area, allow IAM Identification Middle for consumer authentication, and configure S3 tables with Lake Formation for ruled knowledge entry. Additionally they create devoted producer and client initiatives, deploy the mandatory infrastructure by means of atmosphere blueprints (primarily based on CloudFormation), and assign customers and teams with acceptable permissions. This setup helps guarantee a constant, well-governed basis that knowledge engineers and knowledge scientists can use to construct, publish, and eat ML options seamlessly inside SageMaker Unified Studio.
Stipulations
The next conditions should be accomplished to assist make sure that the required AWS companies and permissions are correctly arrange for seamless integration and governance.
- Allow IAM Identification Middle
- Navigate to IAM Identification Middle within the console.
- Select Allow if not already activated.
- Select Allow with AWS Organizations or standalone and full the setup wizard.
- Allow S3 Tables
- Go to Amazon S3 within the console.
- Choose the S3 Tables part and select Allow S3 Tables integration.
- Lake Formation administrator
- Confirm that the console function is added to Lake Formation directors for correct useful resource creation and permissions administration.
Arrange the atmosphere
After finishing the conditions, the following step is to arrange the atmosphere by creating and configuring the SageMaker Unified Studio area, which serves because the central workspace for managing customers, initiatives, and knowledge belongings throughout the offline characteristic retailer structure.
Arrange a SageMaker Unified Studio area:
- Navigate to the SageMaker Unified Studio console and select Domains within the navigation pane. Select Create area after which select Fast Setup.
- Select to create a brand new VPC or choose an current VPC for the area community configuration.
- Increase Fast setup settings and enter a site identify (for instance, Company).
- For Area Execution function and Area Service function, choose Create and use a brand new service function.
- Choose a VPC (VPCs tagged with SageMaker Unified Studio ought to be accurately configured) and select no less than three personal subnets, every in a unique Availability Zone.
- Select Proceed to proceed with area creation.
- Create an IAM Identification Middle consumer account by offering an e mail deal with, first identify, and final identify (you’ll obtain an e mail to activate your IAM Identification Middle account after creation).
- Select Create area to begin the area creation course of. Area creation sometimes takes 2–5 minutes to finish.
Create a SageMaker Unified Studio venture:
- Go to the SageMaker Unified Studio area URL and sign up utilizing the IAM Identification Middle credentials created throughout area setup and settle for the invitation e mail if acquired.
- Create initiatives:
- Go to Choose Undertaking and select Create venture from the area homepage to create a producer venture and identify it airlines_core_features. Add descriptions, configure settings, and assign customers and permissions.
- For Undertaking profile, choose All Capabilities.
- Repeat the 2 previous steps to create a client venture names airlines_ml_models.
Word :- Undertaking creation sometimes takes 5–10 minutes to finish.
- Copy the venture function ARN
- Navigate to airlines_core_features venture particulars.
- Navigate to the venture and replica the venture function Amazon Useful resource Title (ARN) (arn:aws:iam::ACCOUNT:function/datazone_usr_role_*)
- Repeat the previous two steps for airlines_ml_models
Infrastructure deployment
Deploy the CloudFormation stack solely after finishing the earlier steps, together with area creation, venture creation, IAM Identification Middle setup, and S3 Tables enablement. The stack requires the next enter parameters:
- IAM Identification Middle ID: Navigate to the IAM Identification console to retrieve the primary a part of the AWS entry portal URL. It is going to appear like https://d-1234da5678.awsapps.com/begin
- SMUSProducerProjectRoleName: Navigate to the Unified Studio venture portal, open your producer venture, and retrieve the venture function identify. For instance, arn:aws:iam:::function/datazone_usr_role_xxxx_yyyy
- SMUSConsumerProjectRoleName: Repeat the previous step to retrieve client venture function identify.
After the stack deploys efficiently, the next AWS assets shall be created:
- S3 bucket: For uncooked and curated knowledge storage, akin to amzn-s3-demo-blog-smus-featurestore-{account-id}
- S3 desk bucket: For Iceberg desk storage, akin to amzn-s3-demo-airlines-s3tables-bucket
- S3 desk namespace: Organized characteristic desk namespace, akin to airways
- S3 desk: for characteristic storage, akin to fg_airline_features
- Glue Database: for supply knowledge catalog, akin to airline_raw_db
- IAM Identification Middle teams: Producer and client teams, akin to FeatureStore-Producers, FeatureStore-Customers
- Lake Formation permissions: High quality-grained entry management, akin to database and desk permissions
- IAM roles and insurance policies: For service integrations, akin to – SageMaker execution roles.
Add SSO teams to the SageMaker area
Assign IAM Identification Middle teams for single signal on (SSO) to the SageMaker area to allow consumer entry and collaboration throughout initiatives.
- Navigate to the SageMaker Unified Studio console and select area from the navigation pane.
- Goto Consumer administration.
- Select Add customers and teams to handle area entry.
- Choose SSO Teams.
- Select the SSO teams that have been created throughout infrastructure deployment:
- feature-store-admin-group (for admin)
- feature-store-producer-group (for knowledge engineering groups)
- feature-store-consumer-group (for knowledge science groups)
- Select Add to finish the task and allow group entry
- The Consumer Administration part will present the SSO teams as Assigned.
Create IAM Identification Middle customers
Create particular person customers for knowledge producers, shoppers, and directors to entry the SageMaker area.
Create the info producer consumer:
- Navigate to the IAM Identification Middle console and select Customers from the navigation pane.
- Select Add consumer enter the consumer particulars:
- Username: Enter dataproducer.
- Electronic mail: Enter a sound deal with, akin to dataproducer@instance.com.
- First identify: Enter Knowledge.
- Final identify: Enter Producer.
- Assign the consumer to feature-store-producer-group.
- Ship an invite e mail for account activation.
Create the info client and admin customers:
Use the identical steps to create the info client and admin customers, with the next variations for step 2:
- Create the info client consumer:
- Username: Enter dataconsumer.
- Electronic mail: Enter a sound deal with, akin to dataconsumer@instance.com.
- First identify: Enter Knowledge.
- Final identify: Enter Client.
- Assign the consumer to feature-store-consumer-group.
- Create the admin consumer:
- Username: Enter dataadmin.
- Electronic mail: Enter a sound deal with, akin to dataadmin@instance.com.
- First identify: Enter Knowledge.
- Final identify: Enter Admin.
- Assign the consumer to feature-store-admin-group.
Add consumer teams to initiatives
Sign up to the SageMaker Unified Studio company area because the admin consumer from UI Console and assign the consumer teams to their corresponding initiatives for correct entry management.
Assign customers to the Producer venture:
- Navigate to airlines_core_features venture settings.
- Go to the Members part.
- Add feature-store-producer-group with acceptable venture permissions.
Assign customers to Client venture:
- Navigate to airlines_ml_models venture settings.
- Go to the Members part
- Add feature-store-consumer-group with acceptable venture permissions.
Add a dataset to an S3 Bucket
Create the S3 prefix /uncooked/AirlineDelayCause/ within the S3 bucket amzn-s3-demo-blog-smus-featurestore- created by the CloudFormation template. Use the Amazon S3 console to add the pattern airline delay dataset to the S3 bucket prefix s3://amzn-s3-demo-blog-smus-featurestore-/uncooked/AirlineDelayCause/
Confirm knowledge entry
After importing the dataset, navigate to the airlines_core_features venture and question the info utilizing AWS Knowledge Catalog:
SELECT * FROM “awsdatacatalog”.”curated_db”.”airline_delay_cause” LIMIT 10;
Confirm the characteristic retailer desk
Question the characteristic retailer desk to confirm that its accessible. It is going to return zero data however ought to execute with out errors.
SELECT * FROM “s3tablescatalog/airline”.”airline”.”fg_airline_features” LIMIT 10;
This workflow demonstrates how knowledge engineers create and share options utilizing SageMaker Unified Studio and S3 Tables.
Sign up as the info producer consumer
Navigate to the SageMaker Unified Studio area and sign up utilizing IAM Identification Middle credentials because the dataproducer consumer (member of the feature-store-producer-group), then entry the airlines_core_features venture.
Create a characteristic engineering job
- Obtain the airlines-delay-cause-feature-engineering-pipeline script to your native system.
- Navigate to Construct Instruments and choose Knowledge processing Jobs.
- Choose the airlines_core_features venture, then select Proceed.
- Go to the Code-based job part and select Create job from file.
- Select file and choose Add your native file. Choose the File and select Subsequent.
- Enter airlines-delay-cause-feature-engineering-pipeline because the job identify and select Submit.
- Choose the created job and select Run Job to run the pipeline.
Create the info supply connection
- Navigate to the Knowledge sources.
- Select Create knowledge supply.
- For Knowledge supply Title, enter airlines_features_datasource.
- For Knowledge supply sort, choose AWS Glue Knowledge Catalog.
- Go to Knowledge Choice and enter the catalog identify as s3tablescatalog/airways.
- For the database identify, choose airways from the dropdown record.
- Set Desk choice standards as * and select Subsequent.
- In Publish belongings to the Catalog, choose No, and select Subsequent.
- Set Run choice to On Demand and select Subsequent.
- Select Create to create the info supply and uncover tables.
Publish characteristic belongings
- Go to Belongings and choose the fg_airline_features asset, which was created by the info supply job.
- Select Publish asset, add metadata and descriptions, and full publishing to make it discoverable throughout initiatives.
Confirm the characteristic retailer
Question the created characteristic retailer loaded by the info processing job.
- Open the question editor from the Construct Instruments pulldown.
- Set the connection to s3tables catalog, and run the next question:
SELECT * FROM “s3tablescatalog/airways”.”airways”.”fg_airline_features” LIMIT 10;
Now the characteristic retailer desk shall be discoverable by different initiatives.
This part demonstrates the end-to-end machine studying workflow for knowledge scientists consuming options from the offline characteristic retailer constructed with S3 Tables and SageMaker Unified Studio, beginning with the dataconsumer consumer sign up.
Sign up as the info client consumer
Navigate to your SageMaker Unified Studio area URL, sign up utilizing the info client credentials and choose the patron venture airlines_ml_models.
Seek for options
Use the search bar and enter fg_airlines_features because the characteristic retailer identify to seek out the revealed asset, then choose the fg_airlines_features catalog asset. You too can use the AI-powered catalog search to seek out options utilizing the partial identify of options and their description.
Subscribe to belongings
Select Subscribe for the chosen airline characteristic asset, enter a enterprise justification for ML mannequin growth, and submit the entry request for approval.
Entry accredited belongings
The info producer consumer can view the pending subscription requests of their workflow. When the subscription is accredited by a knowledge producer, the fg_airline_featurestore desk shall be seen within the venture catalog database and obtainable to question, as proven within the following determine.
Knowledge lineage monitoring
Knowledge lineage in SageMaker Catalog is an OpenLineage-compatible characteristic that you should use to seize and visualize lineage occasions—from OpenLineage-enabled programs or by means of APIs—to hint knowledge origins, observe transformations, and think about cross-organizational knowledge consumption. To view knowledge lineage, choose the Lineage tab to show the whole lineage of your subscribed belongings, as proven within the following determine.
Question the characteristic retailer with time-travel capabilities
After you might have entry to the revealed characteristic retailer, you should use the next queries in SQL Question Editor to question the info and use Iceberg’s time journey capabilities.
Question the characteristic retailer:
SELECT * FROM “fg_airline_features” LIMIT 10;
Retrieves the newest 10 data from the characteristic retailer.
View obtainable variations:
SELECT snapshot_id, committed_at, operation, abstract FROM “fg_airline_features$snapshots” ORDER BY committed_at DESC;
Lists all historic snapshots with timestamps and operations, displaying how the characteristic desk advanced.
Time journey queries
— Question particular model
SELECT * FROM “fg_airline_features” FOR VERSION AS OF <snapshot_id_here> LIMIT 10;
Retrieves options from a selected snapshot model, making certain reproducibility for mannequin coaching.
— Checklist Obtainable Timestamps:
SELECT committed_at FROM “fg_airline_features$snapshots” ORDER BY committed_at DESC;
Lists all timestamps when the characteristic desk was modified.
— Question as of timestamp
SELECT * FROM “fg_airline_features”
FOR TIMESTAMP AS OF TIMESTAMP LIMIT 10;
Retrieves options as they existed at a selected time.
View desk historical past
SELECT * FROM “fg_airline_features$historical past”
ORDER BY made_current_at DESC;
Shows the whole audit path with snapshot IDs and timestamps for compliance and debugging.
This client stream allows environment friendly characteristic reuse throughout groups whereas sustaining correct governance and entry controls.
Use an S3 desk as an offline characteristic retailer
On this part, you learn to arrange an S3 desk to make use of as an offline characteristic retailer for coaching and batch inference.
- Obtain the pattern training-batch-inference pocket book, which demonstrates the best way to use an S3 desk as an offline characteristic retailer for mannequin coaching and batch inference.
- Go to the patron venture airlines_ml_models.
- Go to Construct Instruments and choose Jupyter Pocket book, then add the downloaded pocket book. Configure the next parameters within the pocket book primarily based in your venture:
- Bucket Enter . For instance, sagemaker_project_bucket
- Area: Enter . For instance, us-east-1
- s3_path: Enter s3://< amzn-s3-demo-bucket-name >//dev/
- database: Enter= . For instance, project_glue_database
- table_name: Enter . For instance, project_feature_group_table
This pocket book implements coaching and batch inference pipeline for airline delay prediction utilizing the Amazon SageMaker XGBoost regression algorithm. The pipeline executes the next steps:
- setup_mlflow: This perform units up the MLflow experiment utilizing an current SageMaker MLflow server on your venture.
- load_data_from_athena: This perform use an Athena connection to load options from the S3 desk into check and practice knowledge frames.
- train_model_with_comprehensive_metrics: This perform persists the coaching dataframe to Amazon S3 and makes use of that dataset as enter to SageMaker XGBoost estimator to coach the airline delay prediction regression mannequin. The coaching experiment parameters and the metrics S3 desk snapshot ID and question are logged to the MLflow experiment for reproducibility as proven within the following determine.
- batch_transform: This course of makes use of the check dataframe break up as enter to compute the airline delay predictions utilizing the educated mannequin. Enter, output and efficiency metrics for this batch rework are logged within the MLflow experiment.
- analyze_predictions_with_evaluation: This perform compares predictions to actuals and produces metrics and charts as proven within the following. The charts embrace:
- Prediction distribution of the delay charge
- Precise in comparison with predicted delay charge
- Residuals plot of predicted delay charge
- Precise in comparison with predicted distribution
Implementing an offline characteristic retailer with Amazon SageMaker Unified Studio and SageMaker Catalog gives a unified, safe, and scalable strategy to managing ML options throughout groups and initiatives. The built-in structure helps guarantee constant governance, lineage monitoring, and reproducibility, whereas enabling seamless collaboration between knowledge engineers and knowledge scientists. Through the use of Amazon S3 Tables with Apache Iceberg, organizations can use ACID compliance and time-travel capabilities, enhancing the reliability of coaching knowledge and mannequin efficiency. The publish–subscribe sample simplifies asset sharing, reduces duplication, and accelerates mannequin growth life cycles.
To discover these capabilities in your AWS atmosphere, arrange your SageMaker Unified Studio Area, publish your first characteristic dataset, and unlock the complete potential of reusable, ruled ML belongings on your group.
We sincerely admire the considerate technical weblog overview by Paul Hargis, whose insights and suggestions have been invaluable.
Concerning the Authors
Vijay Velpula
Vijay Velpula is a Knowledge Architect at AWS Skilled Companies. He has intensive expertise designing and implementing large knowledge and analytics options to assist prospects optimize their cloud infrastructure. He’s enthusiastic about constructing scalable, dependable knowledge architectures that speed up digital transformation. Outdoors of labor, he loves climbing, biking, and exploring new locations together with his household.
Ram Vittal
Ram Vittal is a Principal GenAI/ML Specialist at AWS. He has over 3 many years of expertise constructing distributed, hybrid, and cloud purposes. He’s enthusiastic about constructing safe, scalable, dependable AI/ML and massive knowledge options to assist prospects with their cloud adoption and optimization journey. In his spare time, he rides bike and enjoys the character together with his household .

