Back to Insights

Blog

How to Manage AWS SageMaker and Bedrock Costs: A FinOps Playbook for AI Teams

June 5, 2026

Denys Yermakov

5 min to read

There’s a number every CTO in financial services, insurance, and SaaS has seen at least once: a cloud bill that grew 40% quarter-over-quarter with no corresponding growth in revenue, users, or shipped features. According to recent surveys , 90% of multinational organizations have already identified measurable AI/ML cost optimization opportunities they haven’t acted on. 60% report persistent underutilized AWS cloud resources they can’t explain. The cloud FinOps market is on track to reach USD 26.9 billion by 2030 precisely because this problem is getting harder, not easier – AI workloads are the fastest-growing line item on the cloud bill, and the least understood by the people approving the budget.

The root cause is structural. AI/ML workloads don’t behave like the compute and storage costs finance teams learned to forecast. They’re priced through tokens, provisioned throughput units, and GPU-hours – mechanisms that don’t map to how annual budgets are built. A single poorly engineered prompt can cost more than a thousand well-designed ones. A GPU instance left running through a long weekend can erase a month of savings plan optimization. A model deployed on the wrong instance type at launch will overspend every day for the life of that model in production. None of this shows up in a standard cloud cost dashboard without governance infrastructure built specifically for AI/ML workloads.

This article is the operational guide for building that infrastructure: phase by phase, mapped to the AWS Well-Architected Machine Learning Lens, with the real-world patterns that show what it looks like when it works.

Before we start: five terms you’ll need

FinOps – the discipline of applying financial accountability to cloud infrastructure. It bridges engineering decisions and financial outcomes by giving both teams a shared language around cost.

Tokens – the unit LLMs read and write, and how API usage gets billed. Roughly ¾ of a word each. Input tokens (what you send) and output tokens (what the model returns) are priced separately. A poorly constructed prompt can cost 10× more than a well-engineered one doing the same job.

Provisioned Throughput Units (PTUs) – reserved LLM inference capacity you commit to and pay for whether you use it or not. Cheaper per-token than on-demand when utilization is high; more expensive when it isn’t.

Showback vs. chargeback – showback tells teams what they spent; chargeback makes them pay for it out of their own budget. Showback drives awareness. Chargeback drives behaviour change.

Unit economics – cloud spend expressed per meaningful business output: cost per inference, cost per training run, cost per 1,000 tokens. This is what transforms a billing dashboard into a decision-making tool.

Why AI/ML Costs Are Different

Before getting into specifics, it’s worth understanding why AI/ML cost management is a materially different discipline from general cloud FinOps.

Token-based billing doesn’t behave like compute billing. A single poorly constructed prompt can consume more tokens than a hundred well-engineered ones. Input and output tokens are often priced differently. Context window size: how much prior conversation the model can “see” directly multiplies your token count. None of this is visible in a standard cost dashboard without purpose-built attribution.

GPU scarcity means provisioning decisions have long lead times. Unlike CPU instances where scaling is near-instant, GPU capacity, especially for large training runs: may require weeks of lead time to secure at scale. That means cost forecasts need to be built ahead of provisioning decisions, not after.

Experimentation is structurally expensive. ML development is iterative. Teams run hundreds of small training jobs, compare model variants, retrain on new data, and tune hyperparameters continuously. Each of those operations costs money, and without guardrails, experimentation costs in development environments can rival production inference costs.

The ML lifecycle has six distinct cost phases. The AWS Well-Architected ML Lens defines these as: business goal identification, ML problem framing, data processing, model development, deployment, and monitoring. Each phase has its own dominant cost drivers, anti-patterns, and optimization levers. Managing total AI/ML spend requires operating across all six simultaneously.

Phase 1: Business Goal Identification – Build the ROI Case Before You Build the Model

The most expensive mistake in ML is building a model for a problem that didn’t need one. The second most expensive is building the right model without defining how you’ll measure its value.

AWS Well-Architected best practice makes this explicit: define overall return on investment and opportunity cost before committing resources. This means classifying every ML initiative as either research-oriented (long-horizon, exploratory, uncertain returns) or development-oriented (applying established methods to near-term business value). The classification matters because the financial model is completely different – research projects need budget flexibility and patience; development projects need clear ROI timelines and production cost projections.

For CTOs and CFOs in regulated industries, this distinction also has governance implications. A research project that unexpectedly transitions into a production workload without updated cost modeling is a budget control failure waiting to happen.

What good looks like: A cost-benefit model built before the first GPU spins up. That model should account for data preparation, infrastructure, data scientist time, ongoing maintenance and retraining, and the business cost of model errors – not just raw compute. Use AWS Cost Explorer, AWS Budgets, and AWS Cost Anomaly Detection to set the financial baseline and alerting thresholds from day one.

The second foundational practice here is using managed services to reduce total cost of ownership. The TCO of Amazon SageMaker AI over a three-year period is substantially lower than equivalent self-managed infrastructure on EC2 or EKS, primarily because you’re not paying for the operational overhead of managing the infrastructure layer. For organizations with constrained ML engineering talent, which is most organizations this isn’t just a cost argument, it’s a capacity argument.

Phase 2: ML Problem Framing – Validate the Approach Before Committing to the Architecture

Not every business problem that sounds like an ML problem is one. Fraud detection with constantly evolving patterns benefits from ML. Inventory reorder logic with stable rules probably doesn’t. Before investing in model development, AWS Specialists recommend a structured comparison against simpler alternatives: rules-based systems, lookup tables, statistical heuristics.

The cost implication is significant: a rules-based system that achieves 85% of the accuracy of a custom ML model at 10% of the operational cost is often the right answer for a VP of Engineering trying to deliver value efficiently.

When ML is the right answer, they introduce the custom-versus-pre-trained decision. This is where many organizations leave money on the table. Amazon SageMaker AI JumpStart provides access to over 150 pre-trained open-source models deployable in minutes. Amazon Bedrock offers foundation models from leading providers through a single managed API. For many common use cases: document classification, sentiment analysis, entity extraction, code generation starting with a pre-trained model and adapting it via fine-tuning or RAG is faster, cheaper, and less risky than custom development.

The hidden cost in always building custom: data scientist time. Senior ML engineers are one of the scarcest and most expensive resources in technology. Every hour they spend rebuilding functionality available in a managed service is an opportunity cost, not just a direct cost.

Phase 3: Data Processing – Where 60–80% of ML Time Goes

Data preparation consistently consumes the majority of ML project time and cost. The FinOps opportunity here is structural: managed tooling, feature reusability, and automated labeling. Our practice addresses data labeling, which can be surprisingly expensive at scale. Amazon SageMaker Ground Truth Plus delivers up to 40% cost reduction versus building custom labeling infrastructure, by combining an expert workforce with active learning that automates labeling for similar items over time.

What is active learning in data labeling? A technique where the labeling system learns from human annotations and begins automatically labeling items it’s confident about only flagging uncertain cases for human review. Over time, the proportion requiring human labeling decreases, reducing cost per labeled record. SageMaker Ground Truth has active learning built in.

For organizations in financial services or insurance where labeled datasets are often proprietary and expensive to produce, this is material. Our Practice targets the analyst productivity problem: data preparation shouldn’t require custom Python scripts for every task. Amazon SageMaker AI Data Wrangler and SageMaker AI Canvas provide visual interfaces that dramatically reduce the time from raw data to model-ready features. The integration of Amazon Q for natural language data preparation assistance further compresses this timeline.

The highest-leverage cost practice in data processing is enabling feature reusability through Amazon SageMaker AI Feature Store. What is a Feature Store? A centralized repository for storing, sharing, and reusing the engineered data inputs (called “features”) that ML models are trained on and scored against. Without a Feature Store, different teams independently recalculate the same features: paying for the same compute multiple times and introducing inconsistency between training and production environments.

SageMaker Feature Store has two storage layers: an online store (millisecond-latency retrieval for real-time inference) and an offline store (historical data in S3 for model training and batch scoring). Both stay in sync automatically.

Phase 4: Model Development – Where the Largest Cost Variability Lives

This is the phase where cloud bills can move by orders of magnitude based on a handful of engineering decisions. Fourteen distinct cost optimization best practices apply here in the Well-Architected ML Lens. The most impactful for organizations at scale:

Instance selection is not a set-and-forget decision. The same training job can cost dramatically different amounts depending on whether you’re running on a P3, P4, G4, or Trainium instance and the right choice depends on your model architecture, dataset size, and training approach. Deep learning on image, video, or language data benefits from GPU-accelerated instances. Traditional ML algorithms are often more cost-effectively trained on CPU instances.

The anti-pattern here is significant and widespread: teams provision the same instance type for training and inference, or default to the largest available GPU instance “to be safe.” Both behaviours create unnecessary cost.

Manage the experimentation tax (MLCOST04-BP03, BP08, BP09). Local training for small-scale experiments avoids unnecessary cloud spend during prototyping. Starting training runs with small datasets to validate approach before scaling to full data reduces the cost of failed experiments. Stopping resources when not in use , through automated policies, not manual discipline eliminates idle GPU costs. SageMaker AI supports automatic shutdown of training jobs and notebook instances; these features should be enabled by default in any cost-conscious engineering organization.

Managed Spot Training (MLCOST04-BP06) is the single highest-ROI cost lever for training workloads. Spot instances can reduce training costs by up to 90% compared to on-demand pricing. SageMaker AI’s managed Spot training handles checkpointing and automatic recovery from interruptions, making it practical for production training pipelines, not just experimental ones.

Quantization changes the economics of inference. Accuracy impact is typically minimal for production use cases. A Llama 2 13B model quantized to 4-bit can run on a $500/month instance instead of a $2,500/month instance with no measurable degradation in output quality for most text tasks. Quantization is an engineering decision made once at deployment that determines your cost structure permanently.

Budget and tagging discipline from the start. Every training job, every endpoint, every processing job should be tagged with project, team, environment (dev/staging/prod), and model version. AWS Cost Categories and AWS Cost Explorer turn these tags into real-time spend attribution. Without tagging, you have a total AI/ML bill. With tagging, you have accountability and the ability to identify which model versions, teams, or experiments are driving cost growth before it becomes a problem.

Tagging is the prerequisite for everything else. AWS resource tags are key-value pairs attached to every cloud resource. A well-designed tag taxonomy looks like this: team: fraud-platform, environment: production, model: credit-risk-v3, cost-centre: retail-banking. These tags flow through to Cost Explorer, AWS Budgets, and Cost Anomaly Detection, enabling cost attribution by team, model, and environment without any custom tooling. Tagging must be established before production deployment. Retrofitting it after the fact against six months of untagged costs is one of the most painful and avoidable FinOps remediation exercises.

Phase 5: Deployment – Matching Infrastructure to Inference Patterns

Deployment is where AI/ML costs become recurring. The decisions made here determine the ongoing monthly spend for the life of the model in production. Our practicioners introduce the core deployment decision framework: the right hosting option depends on your inference pattern, not your preference.

Four SageMaker inference modes: which one fits your workload?

Right-sizing the inference fleet. SageMaker AI Inference Recommender automates this process: it benchmarks your model against different instance types and provides recommendations based on actual latency and throughput performance, not guesswork. Multi-model endpoints, which host multiple models on a single instance, can dramatically improve utilization for organizations with many smaller models.

The provisioned throughput decision requires a different cost model. For API services like Amazon Bedrock and Azure OpenAI, the choice between on-demand per-token billing and provisioned throughput (PTU commitment) isn’t purely about cost – it’s about utilization.

On-demand vs. provisioned throughput – the utilization test

On-demand billing charges per token consumed. No commitment, no minimum, full flexibility. The right default for variable or growing workloads.

Provisioned throughput charges a fixed hourly rate for a reserved capacity block, regardless of actual usage. At 70%+ utilization, it’s cheaper per token than on-demand. At 30% utilization: typical of a business application used only during office hours: the effective cost-per-token can exceed on-demand pricing.

The test before committing: benchmark your peak tokens-per-minute, calculate your average daily utilization rate, and model both scenarios at your actual traffic pattern. Buying PTUs without this calculation is one of the most common and expensive AI/ML FinOps mistakes.

For regulated industries: AWS Inferentia2 and Trainium2 instances provide dedicated, single-tenant compute for inference and training workloads, which matters for data residency and compliance requirements in financial services, insurance, and healthcare.

Phase 6: Monitoring – Protecting ROI After Launch

Model monitoring is where AI/ML FinOps closes the loop between technical performance and financial performance. A model that degrades silently costs money two ways: it consumes inference compute while delivering declining business value, and eventually it requires expensive emergency retraining.

Our practitioners establish the foundation: monitor usage and cost by ML activity. Amazon SageMaker AI Model Monitor continuously checks for data drift, model drift, and prediction quality degradation. Amazon CloudWatch surfaces these metrics alongside infrastructure utilization, enabling unified dashboards that correlate model performance with cost per inference.

Data drift and model drift: the silent cost multipliers

Data drift occurs when the statistical distribution of incoming data shifts away from what the model was trained on. Example: a fraud detection model trained before a new payment method was introduced starts seeing transaction patterns it was never trained on and its accuracy degrades without any error being thrown.

Model drift occurs when the relationship between inputs and the correct output changes due to real-world shifts : customer behaviour after a market event, seasonal patterns the model didn’t learn. Both types degrade performance silently, consuming inference cost while delivering less business value every day they go undetected.

SageMaker Model Monitor detects both automatically, comparing production data statistics against a baseline established at deployment and alerting when deviation exceeds your configured thresholds.

We at Dedicatted take this further: monitor return on investment for ML models directly. This means establishing business KPIs tied to model outputs: conversion rates, fraud catch rates, churn prediction accuracy and tracking them alongside cost metrics. A model that costs $15,000/month in inference but prevents $500,000/month in fraud losses has a clear ROI story. A model that costs $8,000/month and hasn’t moved its target business metric in six months is a candidate for replacement or simplification.

The FinOps Maturity Ladder for AI/ML

Organizations in the early stages of AI/ML adoption typically start with reactive cost management: they see the bill, investigate the spike, add a guardrail. Mature organizations operate with proactive financial governance built into the ML development lifecycle from the first commit. The markers of maturity look like this:

The distance between crawl and run isn’t primarily a tooling problem. AWS provides most of the instrumentation needed. It’s a process and accountability problem and that’s exactly where a FinOps discipline, applied specifically to AI/ML workloads, delivers its value.

Where Dedicatted Can Help

Regulated industries face a compounding challenge: the AI/ML cost management practices described above need to coexist with compliance requirements around data residency, model explainability, audit trails, and access controls. A chargeback model built on tagging is straightforward in a greenfield environment; it’s considerably more complex in a financial services organization managing dozens of business units across multiple regulatory jurisdictions.

Dedicatted’s work as an AWS Advanced Tier Services Partner and responsible AI advisory firm is precisely at this intersection. We help organizations in financial services, insurance, retail, and SaaS build the governance frameworks, tagging architectures, and FinOps practices that make AI/ML cost management both rigorous and compliant without slowing down the teams building the models that matter. The cloud bill is decodable. It just requires the right lens.

Real-World Use Cases: What FinOps for AI/ML Looks Like in Practice

The frameworks in this article aren’t advisory positions – they’re the architecture of work we’ve already done. What follows are three engagements drawn from Dedicatted’s practice in regulated industries. Client names are withheld; the problems, the interventions, and the outcomes are real.

Financial Services: Building a Compliance-Ready Chargeback Model from Zero

The situation. A Canadian financial institution with multiple business lines: personal banking, commercial lending, and wealth management had been running AI/ML workloads on AWS for 18 months. Each line of business had its own data science team, its own SageMaker environment, and its own interpretation of what “cloud cost” meant. Finance could see a total monthly AI/ML bill. They could not tell which business line was responsible for any part of it. When the internal audit team flagged the opacity of AI infrastructure spend as a governance gap, the engagement started.

What we built. Dedicatted designed and implemented a three-layer cost attribution architecture. The first layer was a tagging taxonomy aligned with the institution’s regulatory reporting structure: every SageMaker training job, processing job, endpoint, and storage bucket tagged with business line, model identifier, environment (dev/staging/prod), and cost centre. The second layer was AWS Billing Conductor, configured to distribute shared infrastructure costs: Reserved Instance savings, shared data pipelines, centralized Feature Store across business units according to actual consumption ratios rather than arbitrary splits. The third layer was a set of AWS Cost Explorer dashboards, scoped by business line and surfaced to both the engineering leads and their corresponding finance partners – the first time those two groups had looked at the same number at the same time.

What changed. Within 90 days of go-live, each business line’s data science team was receiving weekly cost attribution reports tied to their own budget. A credit risk model that had been running an oversized real-time endpoint 24/7 : despite being used only during business hours was identified and moved to a scheduled endpoint pattern, cutting its monthly inference cost by 61%. The fraud detection team, seeing their own cost-per-inference for the first time, voluntarily initiated a model quantization review that reduced their GPU instance requirement by one tier.

The principle behind this engagement: Cost accountability only changes behaviour when the people making the engineering decisions can see the financial consequences of those decisions in their own reporting line. Chargeback is the mechanism. Tagging architecture is the prerequisite.

Here’s a practical starting point: before your next board discussion on AI fraud detection, know your governance score. Our free OSFI AI Governance Assessment benchmarks your institution against Guideline E-23, the EDGE principles, the AMF AI Guidelines, and the NIST AI Risk Management Framework – the exact standards your regulators are using to evaluate you. It covers the six areas where Canadian banks most commonly have gaps: governance and oversight, explainability and transparency, data quality, ethics and fairness, model lifecycle management, and third-party risk. Thirty questions. Ten minutes. A scorecard that tells you not just where you stand, but what closing each gap would actually require.Take the free assessment and walk into your next board meeting prepared

Regulated Enterprise: Embedding FinOps Governance Across a Multi-Team AWS Estate

The situation. A federally regulated Canadian enterprise with over 40 AWS accounts and seven internal product teams had reached a FinOps maturity ceiling. They had tagging policies inconsistently applied. They had AWS Budgets alerts: mostly ignored because they were set at account level, not workload level. They had a central cloud team trying to govern spend they couldn’t attribute and couldn’t enforce. When AI/ML workloads began scaling across three of the seven teams simultaneously, the governance gap became urgent.

What we built. Dedicatted ran a four-week FinOps maturity assessment across all 40 accounts, mapping actual tagging compliance rates, identifying the 23 resource types generating 89% of AI/ML spend, and benchmarking utilization against the AWS Compute Optimizer recommendations that had been sitting unreviewed in the console for months. The output was a prioritized remediation roadmap with estimated cost impact per initiative.

The first intervention was tagging enforcement: we implemented AWS Config rules that flagged untagged SageMaker resources within 15 minutes of creation and triggered a notification to the owning team’s Slack channel. Non-compliance dropped from 67% to under 8% within six weeks. The second intervention was Budget Actions: Budgets were rebuilt at workload level one per model family, not one per account with automated actions that throttled new training job launches when a workload exceeded 85% of its monthly allocation before the 25th of the month. The third intervention was a quarterly commitment optimization cadence: Dedicatted reviews the client’s SageMaker Savings Plans coverage ratio and Reserved Instance utilization every quarter and adjusts commitments based on the prior 90 days of actual consumption patterns.

What changed. Total AI/ML spend in the 12 months following the engagement was 34% lower than the preceding 12 months, despite a 40% increase in the number of models in production. The cost reduction came entirely from governance: eliminating idle resources, right-sizing endpoints, enforcing Spot Training for non-critical runs, and capturing Savings Plans discounts on workloads that were running continuously but unprotected by commitments.

The principle behind this engagement: FinOps maturity isn’t a one-time project. The quarterly commitment cadence is where the ongoing savings live and it requires someone who tracks AWS pricing model changes and knows when your usage patterns have shifted enough to justify restructuring your commitments.

Retail & CPG: Turning Retraining Cadence into a Governed Budget Line

The situation. A Canadian grocery retailer was running SageMaker-based demand forecasting across approximately 180,000 SKUs in 340 store locations. The forecasting platform had been built over two years and was genuinely good – forecast accuracy was a competitive advantage in their category management operation. The cost problem was invisible to everyone except the cloud team: retraining was running on a weekly cadence for all SKU categories, including shelf-stable products where a monthly cycle would produce statistically identical forecast accuracy. Each weekly retraining run was on-demand compute, unprotected by Spot pricing, because the original architecture assumed time-sensitivity that the majority of SKUs didn’t have. Finance had no view into retraining cost at all: it appeared as an undifferentiated line in the SageMaker billing.

What we did. Dedicatted started with a retraining cost audit: we instrumented every training job with tags capturing SKU category, retraining trigger (scheduled vs. drift-detected vs. manual), and model version. Two weeks of tagged data gave us the first granular picture of where retraining compute was going. The analysis showed that 71% of weekly retraining compute was spent on SKU categories where a monthly cadence, tested against historical accuracy data, produced forecast error within 0.4% of the weekly model ,effectively identical for planning purposes.

The architecture change had three components. First, we segmented the retraining schedule: perishables and promotional SKUs remained on weekly cadence; shelf-stable and slow-moving categories moved to monthly. Second, we migrated all monthly retraining jobs to SageMaker Managed Spot Training with checkpointing: these ran on weekend nights when interruption risk was lowest and cost was highest priority. Third, we implemented warm-start HPO: rather than running full hyperparameter search on each monthly cycle, the platform now initializes from the prior cycle’s best configuration and runs a bounded search of 12 trials rather than the previous 80+.

We built a unit economics dashboard: cost per training run by SKU category, cost per forecast accuracy point, that the head of supply chain and the finance controller now review together monthly. It was the first time the two had a shared metric connecting a model quality decision to a budget consequence.

What changed. Monthly retraining cost fell 58% within two billing cycles. Forecast accuracy across the platform was statistically unchanged. The retraining budget is now a line item in the annual technology plan forecast by SKU category, reviewed quarterly, and adjusted when category composition or promotional volume shifts. The finance controller described it as the first AI/ML cost they could actually budget for.

The principle behind this engagement: Retraining frequency is almost always set by data scientists optimizing for accuracy with no visibility into cost. The right cadence is the one that achieves acceptable accuracy at the lowest training cost – a tradeoff that requires finance and data science to be looking at the same unit economics.

Contact our experts!


    By submitting this form, you agree with our Terms & Conditions and Privacy Policy.

    File download has started.

    We’ve got your email! We’ll get back to you soon.

    Oops! There was an issue sending your request. Please double-check your email or try again later.

    Oops! Please, provide your business email.