Tokenomics: Governing AI Costs Before They Escalate

You’ve built the agents. You’ve deployed the copilots. Maybe you’re running local models through OpenClaw and Ollama alongside Azure OpenAI endpoints. For two decades, Microsoft sold work software the way it sold seats — one user, one license, one predictable line on the budget. That model just changed. As of June 2026, Copilot Cowork bills like Azure, not like Office: every task meters the model it picked, the context it pulled, the tools it called, and how long it ran. GitHub Copilot made the same move. The agentic layer of your stack is no longer a seat you buy, it’s a meter that runs every time an agent acts on your behalf.

Most organizations will discover this the way you discover fuel burn on a long day on the water: at the dock, looking at the receipt, wondering where it all went. This session is about governing those costs before they escalate — where AgentOps meets FinOps for the hybrid AI era. We’ll share patterns from real environments balancing Azure-hosted models with local alternatives, show how model routing becomes a deliberate cost lever instead of a default, and walk through a cost-assessment framework that helps leaders make build-vs-buy-vs-run-locally decisions they can defend to Finance. Because AI success was never measured by how many agents you deployed, it’s measured by whether you can govern them like the material business expense they’ve quietly become.

As organizations accelerate AI adoption, a new challenge is emerging: AI costs are becoming more difficult to predict, govern, and optimize. What began as a straightforward licensing conversation has evolved into a complex ecosystem of agents, tokens, model consumption, cloud infrastructure, usage-based billing, and AI-powered workflows.

The reality is that AI is no longer a single line item on a budget. Every interaction can trigger multiple layers of cost—from Copilot licenses and agent messages to token consumption, grounding calls, orchestration workflows, and model execution. As organizations move from experimentation to production, managing these costs becomes essential to scaling AI successfully.

In this session, Brian Haydin, Solution Architect at Concurrency, introduces the emerging discipline of Tokenomics—the practice of understanding, governing, and optimizing AI consumption across the enterprise. Through real-world customer examples and practical governance frameworks, Brian explores how organizations can bridge the gap between AgentOps and FinOps, creating the visibility and controls needed to prevent surprises before they appear on the monthly bill.

As AI workloads become increasingly autonomous, success is no longer determined by selecting the right model alone. Organizations must understand where AI workloads run, how they consume resources, and how to align cost with business value.

This session highlights a key takeaway:

AI cost governance is no longer a licensing problem—it is a workload placement and operational governance challenge. Organizations that develop visibility, accountability, and control across their AI ecosystem will be best positioned to scale AI responsibly and sustainably.

WHAT YOU’LL LEARN

Understanding the New Economics of AI

Why AI spending has become more variable, distributed, and difficult to track.
How a single AI task can consume multiple pricing meters simultaneously, from licenses and credits to tokens and cloud resources.
Why traditional cloud cost management tools often fail to explain the true drivers of AI spend.

The Five AI Cost Governance Planes

How to classify AI investments across licensing, agents, pro-code development, deployment models, and local AI environments.
Why each governance plane has different risks, ownership requirements, and optimization strategies.
How organizations can build a consistent operating model across their entire AI portfolio.

Bridging AgentOps and FinOps

Why AI governance now requires visibility into both operational behavior and financial consumption.
How tracing, telemetry, and observability help explain why AI costs occur—not just what was spent.
Techniques for measuring cost against business outcomes rather than prompt volume alone.

Designing Agents with Cost in Mind

How agent design decisions directly influence consumption and operational expenses.
Why ownership, spending caps, approval processes, and governance controls must be established before agents reach production.
Lessons learned from organizations managing large-scale Copilot Studio and custom AI deployments.

Optimizing Workload Placement and Model Selection

How model routing can automatically balance cost, quality, and performance requirements.
When to use frontier models versus smaller, more efficient alternatives.
How deployment choices such as pay-per-token, batch processing, reserved capacity, and local AI affect overall economics.

Building a Sustainable AI Governance Framework

A practical six-step framework for evaluating AI workloads before deployment.
How to establish accountability through business owners, technical owners, and cost centers.
A 90-day roadmap for improving visibility, governance, and cost optimization across AI initiatives.

FREQUENTLY ASKED QUESTIONS

Is this session technical or business-focused?

Both. The session connects technical AI architecture decisions with financial and operational governance considerations, making it relevant for both business and technology leaders.

Do I need advanced AI experience to attend?

No. The concepts are presented through practical business examples and governance frameworks that apply whether your organization is just beginning its AI journey or already operating AI solutions at scale.

Who is responsible for AI cost governance?

Successful organizations typically share responsibility across IT, finance, business stakeholders, and AI solution owners. A central theme of the session is establishing clear ownership and accountability models.

Is AI becoming too expensive for enterprises?

Not necessarily. The challenge is often a lack of visibility and governance rather than AI itself. Organizations that understand workload placement, consumption patterns, and operational controls can scale AI more effectively and predictably.

What is the biggest takeaway from this session?

Organizations should stop treating AI costs as a licensing problem and begin governing AI as an operational ecosystem where cost, value, and workload placement are managed together.

ABOUT THE SPEAKER

Brian Haydin, Solution Architect, helps organizations design, govern, and operationalize AI solutions across the Microsoft ecosystem. His work focuses on Copilot adoption, custom agent development, AI governance, AgentOps, and helping organizations connect technical innovation to measurable business value. Through customer engagements, thought leadership, and speaking events, Brian enables organizations to scale AI responsibly while maintaining visibility, accountability, and control.

TRANSCRIPT

Transcription Collapsed Transcription Expanded

Brian Haydin 0:06 All right, and we are live. Sorry for, actually, we ran on time, but… My partner in crime isn’t here with me today to get me started. So I want to just say welcome to everybody. Thanks for joining and spending a little bit of time today. Today we’re going to be talking about tokenomics. And this is a term that’s starting to really kind of get a little bit of traction. We have a really great attendance today in terms of the number of people that are that sign up for it. So I’m excited to talk about how to govern AI, you know, and managing all these different costs. My name is Brian. I’m a solution architect here at Concurrency. And, you know, I want to just encourage everybody to follow me on LinkedIn, follow Concurrency. I have a regular newsletter that I published talking about some of these ideas. But what I want to talk about today is like how to bridge 2 disciplines that most organizations are treating separately right now. There’s agent ops and then there’s Finance Ops. And right now in this like hybrid AI world, I don’t think that we should be having conversations about them separately. So before I start, I want to talk a little bit about, I just want to mention Concurrency. For those of you that don’t know what we do, we are a Microsoft solution integration partner. We do ServiceNow work as well. And we do anything from the modern workplace. type of, you know, helping you manage your Microsoft ecosystem, copilot adoption, building custom agents and pro code, low code tools. And follow us on LinkedIn. QR code is up here too, or just search for Concurrency Inc. And you’ll get a bunch of notable. going to get notifications for all these webinars that we do on a regular basis. Love to connect with you on LinkedIn or outside. I want to start with a little bit of like a couple of real examples. And I think when I wrote this abstract up, I promise there’s going to be a lot of stories as well as some frameworks. But Before I get into any of the client stories, I want to talk about some couple of them that like really affected me personally. So I use a tool called GenSpark. It’s a really fantastic tool to build these presentations. And actually, most of the slides that are in this deck were created by GenSpark. I don’t get a commission. you know, and maybe after you hear this story, you’re not going to want to use it. But it’s 20, it was 20 bucks a month when I signed up for it. I tried it out. It was great. Made some really, really like good, you know, good presentations for me. And, you know, I was using it like, you know, quite a bit, a great deal. And then month #2, I started using it a lot more and I blew through that $20 token allotment. You know, and I’m like, all right, well, that’s fine. I was able to buy extra tokens. And so, you know, just loaded up some overage and off I went. Well, the next month I ran out of the tokens again. And so I’m like, all right, well, no big deal. I’ll just buy another, you know, RIA. and it wouldn’t let me do it. So I wasn’t able to buy an overage. I don’t know if it like the policy changed or whatever, or maybe it was like you can only like re-up once. But the only way out for me at that point was to commit to another year of, you know, the higher tier and then I could double my tokens. And so I’m like thinking about that and I’m I’m like, I just made a $20 decision, you know, originally, and now this is turning into a multi-year higher commitment than I had planned for. So I don’t want to make this about like GenSpark in particular, but this is how usage-based pricing is going to work. And for a lot of you, the same thing is happening with Cowork. So here’s another story that we had quite a bit of discussion about. I posted some blogs on about this as well. Microsoft changed the licensing model for the co-work, not copilot, co-work. And so now the same thing that you were doing that was free suddenly starts costing a little bit more. And I got zinged on this one pretty big. And Meghan, if you’re here, sorry I spent so much money. But this is what I wound up doing. I had this like, I had this prompt that I wrote. You can see what I wrote here. Produce a weekly strategic client intelligence briefing for the top 10 customers that I work with on a regular basis. And all of a sudden, like I got a message from, you know, from the person that’s monitoring this at the organization after the July 1st, like, you know, kicked in. Hey, you’re like blowing through money pretty quickly. You know, what are you using it for? And I honestly, I wasn’t even using it. I had this prompt that I was running previously when it was free. But it wasn’t delivering the results to me. So I just sort of forgot that it was running. And when I went and took a look at it, this was the actual cost. Every time that it did this client briefing, it was $20. $20 for that prompt to run that I could run just in regular copilot, get the exact same results and not have to pay any extra money on it. So, you know, maybe you have some similar stories, maybe not, but those are my 2 to like really kind of kick things off. But here’s the part that I like, that really kind of like rubs me the wrong way. I’m here to talk about Tokenomics. And you know, the thing is, I still can’t tell you what it costs. I don’t know what a… Gen. Spark credit really is actually worth. I don’t know the computation. Not really. I know somehow it’s like balanced about some sort of tokens, but there’s no way for me to actually audit it. And you know, Copilot Studio, it’s the same thing. I don’t know what a credit is worth. I don’t know how it burns the tokens. I don’t know what it’s. you know, what its calculations are. And honestly, I don’t even know if it’s the same every month if they’re not changing that pricing behind the scenes. So we’re all basically budgeting with a currency that none of us have the ability to define. Somebody in some back office opaquely is telling us, you know, what this is going to cost. and it’s making it impossible for us to budget that. So that’s what this talk is about. How do I get ahead of this? How do I, you know, plan for this as an organization so that I don’t get these big, huge surprise bills at the end of the month? And here’s how I’m going to start it out. This is all changing really fast. So Don’t be surprised if what I’m talking about today in two months is completely antiquated. But here’s what started in the last quarter. Two years ago, your AI bill was basically one single line item. It was the number of seats times $30, and it was really predictable. Something that your CFO was probably able to plan around. Now take a look at what we’re dealing with. The exact same business task, let’s say it’s just answering a customer’s question. It’s going to be billing across six completely different meters, three different ways of calculating the cost, all at the same time. There’s a per user license. There’s a per message meter. that there’s a copilot credit that’s somehow being consumed. Then there’s the Azure token meter that’s taking somewhere in the background. There might be a block of provision throughput that you reserved. If you’re lucky enough to be able to compute that and do reserve capacities, maybe you’re getting ahead of the curve and running GPUs on a local model someplace. Those are 6 meters, though, one single task. And, you know, for me, and in most organizations, nobody has the ability to see all six of those meters at once. And so it’s not that the bill got bigger, probably did, it’s that it got distributed, too. And distributed cost is the one that’s really the hardest one for us to be able to govern. So if it feels like you’re behind on this, I just would say, don’t worry about it. You’re not alone. The data here is pretty stark. And depending on which one of the surveys that you read, only about one in three organizations say that they have full visibility into what their AI actually costs to operate. About half of them have already delayed, paused, or scaled back in AI projects specifically because of cost concerns. And just this morning, Jim Savage, if you’re listening, sent me a link about Microsoft turning off AI for a lot of their development. teams. Think about that, Microsoft, forefront of this, limiting AI consumption for the people that matter the most. I made the comment to him that, you know, we just had like layoff announcements at Microsoft and if they’re cutting AI, like treat that as another riff, right? That’s more people, essentially these agents being cut. So CFOs, you know, another one of the statistics that I want to talk about is that CFOs are reporting cases where the token usage actually jumped dramatically, like up to sixfold, not because anything was broken, but because the usage grew and nobody was watching what the meter was doing or how to calculate it. So let’s be a little bit more precise about the problem. And the problem isn’t that AI is too expensive. The problem that AI spend has become variable, it’s become distributed, and it’s really hard to tie it back to the business value. The risk isn’t the expensive model. The risk is unmanaged workload placement. And really, it’s a lack of telemetry. So the good news is that’s a governance problem. Well, I mean, maybe that’s the bad news. The bad news is it’s a governance problem. The good news is that governance problems are things that we can actually solve. So if I was going to break this into a single sentence today, I’d say this. AI cost governance is moving from license management to a workload placement management. In the early copilot days, the cost was a licensing question. How many seats are you in the organization going to buy? I think that still matters, but once you’ve got co-pilots and custom agents, you start bringing in cloud models and local models, all those things are in play. the questions are going to change. It’s no longer about which AI model is the best. It’s really about what the lowest cost, lowest risk runtime that still meets the quality, the latency, the security, and the business value bar for this specific workload. And the reason this sneaks up on people is here. Licenses are the cost that you approve and you sign off on. Tokens are the cost that you’re going to wind up discovering. Things that you find out about after the fact. And agents, those are things that actually, that you’ve automatically or accidentally automated. You built something useful, suddenly it gets popular. And next thing you know, it’s spending money on your behalf, you know, while, you know, while you’re sleeping. So let’s talk about why the old tools that we use to manage this, you know, can’t really see that. So most of you probably already have some cloud cost management already. Landing zones have tagging. You can set budgets on your subscriptions. If you’ve got an EA agreement and reserved instances, those are all genuinely good things. I would say keep doing those things. We don’t want you to change any of that. But it’s got some limits, right? Traditional cloud cost management tells you what you spent. It’ll tell you to… to spend what you spent by subscription, by service, maybe by region or by tag. But what it can’t tell you is why an agent spent it. It can’t tell you that the prompt was three times bigger than it needed to be, or that a reasoning model got used where a cheap one would have done just fine. Or maybe you have an agent that got stuck in a retry loop. and called a tool 40 times before it could figure out what it was doing. That behavior, it’s really, you can’t see that on a bill. And in agentic AI, that behavior is where these costs are actually coming from. So the tools that you built already in like your well-architected framework and your cost controls, they serve really good for the infrastructure there. but it’s blind to all these different operations that are happening. And so I want you to understand a little bit about what it’s actually blind to. Think about the mechanism here. A user’s going to ask a question of an agent, just one single question, but walk through this with me for a second and see what it’s actually doing. It retrieves some data context, you know, a bunch of data that it has to build up from. Then it calls a tool. Honestly, maybe it’s calling two tools. And then once it has all that information and access to those tools, it’s going to start to reason through a plan. That’s more tokens that are being generated. Then it decided it needs some help. so it delegates to another agent. What happens there? More tokens. Something comes back ambiguous, so it retries. Then it logs the whole trace and maybe stores some output. And one question in, and look at how many of those different meters that I talked about earlier, that it ran on the way out. And that’s where the surprise bill is actually coming from. It’s almost never like one enormous like model call. I made this huge thing. It’s about amplifying each one of those calls over and over again. One reasonable looking question fanning out into dozens of billable steps. And the tension that should stick with you is that the unit of business value is a completed task. what happened at the end. The customer got their answer. But the unit of cost in every single step that that’s in between, that’s the one that’s hard to do. And a cloud bill is only going to show you what the total cost was at the end of the month. And if you start to use, well, go back to one of my other talks talking about HNOps where we can trace this stuff. That’ll show you why it actually costs how much, why it costs as much as it does. So you’re blind in the old world, but there’s a second reason that that old map kind of fails. And you know, it’s a little bit more fundamental as well. Remember that GenSpark credit that I was talking about that I couldn’t define? Well, this is the systemic version of that problem. And it’s the second reason why your old cost map and strategies don’t really work. Across the entire organization, the billing unit, the credit, it’s not something that you can actually audit down to 1st principles. What is a Cowork credit worth in tokens? What is a Copilot Studio credit worth? Now, I want to be a little bit fair here, because this would turn, this would be easy to like bash Microsoft. And that’s not what I’m trying to do. Microsoft actually, in some of these cases, publishes a credit rate card. You can look it up and you can see that a classic answer costs you one credit and a generative answer is going to cost you two. If you need to hit the graph API or you hit it for grounding, that’s going to cost you 10 credits. That’s actually a lot more transparency than most of the vendors like GenSpark give you. So again, not to bash, you know, not to bash Microsoft. But even with the rate card, mapping that from a credit down to the actual token consumption and the actual costs, it’s still something that like really hard to compute. And it’s not something that your finance team can come back to you and say, help me audit this. So If you can’t manage what you can’t define, like that’s the failure of these principles underneath it, and that’s why the old maps fail. So what I want to try to do is give you a little bit better idea of how to do this. And this is what I’ve come up with prepared for this talk. I’ve got five different control planes. And I think the single business The single biggest mistake that organizations are making is treating all AI costs as like one thing, one single line, one owner, maybe just one strategy, and it’s not. It’s actually split up across these five different games. Each of these games have its own set of rules. Up on top, the SAS licensing plane. That’s your fixed per user, you know, copilot subscriptions. Below that, the metered agent plane, the usage-based credits and messages. Then there’s the pro code plane, where I’ve built custom agents that start burning tokens, start using tools, and have orchestration. Then the deployment economics. This is the commercial shape of how you actually buy things, whether you’re doing a pay per token versus reserved batch or reserved instances. Then at the bottom, finally, something that people are starting to talk to, the local and the hybrid plane, where the cost can actually live inside a hardware and you know, and operations instead of actual tokens. So each of these have a different cost shape. They have different overrun risks, and they need different owners and different controls put in place to be able to manage them. And to make it real, because I don’t want to just give you like a, you know, just a framework slide and say, go use this. I’m going to walk through each one of these, and I’ll give you an actual story that’s associated with this from real customers, real environments, and real numbers. I’m not going to share the names, and some of you are probably going to be embarrassed and reach out to me later. But let’s start a little bit at the top. On the first one, the license plane, plane number one, this is the cost that finance actually sees coming, and it’s that fixed per user subscription. It shows up pretty predictably every single month. So the risk here isn’t really a surprise, Bill. The risk is shelfware. And let me give you an example of that. I’m working with a Midwest manufacturing company, and they had a genuinely meaningful copilot activity, you know, north of, I think it was like 20,000 copilot actions, you know, in a month. And it was climbing. Sounds like a pretty good success story, right? But when we started to break it down with them using the tools, we could see things happening. Several groups, You know, several of the different groups that they were looking at were under three hours a week. Some of the groups barely touched it. And so here’s the thing, activity doesn’t mean value. Aggregate usage looks good, 20,000, you know, 20,000 interactions, but they hide the fact that whole departments aren’t getting any value out of their seats. The control here isn’t really a technical one, it’s a program one. You roll it out by personas and use cases, not all at once. Watch the adoption, measure that telemetry, and reclaim the seats that nobody is using and find users that are actually going to use them. Then you can measure business outcomes, not prompt counts. Adoption is something that you really need to run, not something that you’re going to be able to buy. So plane two, the metered agent plane, and this is where cost feels a little bit safe because the unit is so small. It’s just a penny per message. It’s just a couple of credits per query. And then we start to multiply that by an entire organization’s worth of questions and the sprawl of agents that we’ve created that nobody’s tracking. So here’s a real example. A Midwest packaging manufacturer that I worked with, when we came in to help them govern their agent platform, we discovered that the platform had automatically assigned 10,000 trial licenses across the entire tenant. 10,000. People had access and nobody had approved it. So step one of the governance was literally just to clean up. Then the real work started to happen. The per agent credit caps. How am I going to manage, how am I going to put the caps on that? And A consumption monitoring pipeline so that somebody can actually own the meter and can report back to it. And Look at this, a single prompt, one that grounds against your tenant graph can stack credits really, really quickly. Ten credits for the grounding, plus 2 for the generative answer, and now I’m hitting 12 credits just for the single interaction. And that was a cost that was decided by the developer when it was designed. not when it was actually used. So what’s the lesson here? Agent design and how we actually build these things, it’s a financial design as well. If you let anybody publish Copilot Studio agents with no real ownership and no real cap, you basically just pre-approve the cost that you aren’t going to be able to control or see. So one thing that we recommend is requiring an owner before anything goes into production. Learn how to govern those different environments, allow people to be able to create things, enable the makers, but set agent level limits. Design the cost, you know, on purpose. And Meghan, you did a great job doing that. So hats off. you know, for being collaborative and trying to figure that out. Plane #3. the pro code plane. This is where when we build custom agents, you’ve got your, you know, Azure, Open AI, your Foundry workloads. This is the plane where most of the genuine surprise bills actually live. And this is where that amplification that I talked about a little bit earlier, that when it runs wild, those tokens, those tool calls, those retries, all those reasoning models that are getting used instead of like lighter weight ones and the multi-agent loops. But let’s use a story about a team that actually got ahead of it, because I think actually that’s a really useful way to think about this. Working with this market analytics company, and they were trying to size out this automation project pilot that we were working on. Instead of just guessing, we kind of calculated what their expected token usage, and we deliberately multiplied that by three, and we padded it, just to make sure that they weren’t fooling themselves with an optimistic estimate. That exercise landed them roughly at a $10,000 annual cost for the pilot. And now that number was at a… Was it right? Probably not. But that’s not the point. The point is that they knew the unit economics and how to calculate them before they started to scale this out. They didn’t discover the cost after the fact on a bill. They engineered some conservative estimates up front, and that estimate became a budget that they could actually defend. And that’s the discipline in this plane. Know what the unit cost is, pad it honestly, and then decide on whether you’re going to, you want to scale that. So how do you actually catch this behavior in real time? So the controls. This is the plane that needs the most instrumentation. And this is where my talks on agent ops and this talk on Finance Ops stop being two separate disciplines and they start becoming one. On the instrument side, you track token telemetry broken down by the app, by the agent, by the user, maybe by the workflow. You enable tracing on the tool calls so that you can see which tools are generating costs and latency. And you trace retries and delegations because that’s where the runaway loops are going to hide. Let me just pause for a second. I see some questions here in the chat. What do you use for Agent Cat? There’s a lot of people here, so I’m going to like, I’m going to look at this every once in a while and see if there’s a good question. All right, not that anybody had a bad question, but I’ll get to those maybe a little bit towards the end. Where was I? So we were talking about on the governance side, you measure cost per completed outcome, not cost actually per the prompt, right? Because a completed task is the thing that actually generates the business value. You want to route models so that simple works are going to go to the cheaper models. And then you put the evaluation gates. I talk about evals in my agent ops quite a bit. So you put these gates in place that can confirm that when I downgrade to a cheaper model, that it’s still meeting the quality bar that I want, you know, for that particular interaction. And then another thing you can do is you can set model subset policies. so that basically I’m controlling which models they have access to, so that nobody can quietly start using the most expensive frontier models when others are going to do, you know, for, you know, just using a blanket for everything. And I guess if I was going to say, take anything from this plane, a cloud build, basically tells you what the agent was spending. That’s just the bill part of it. But if you implement agent ops in the tracing, that’ll tell you why it was spending the money. So if you only have the bill, you’re always just kind of reacting to it. But if you have the trace, now you can govern it, right? You can go back and say, that’s a useless action. It’s not providing the business value. and we’re going to start to govern those controls. And that’s the difference between plane, you know, plane 3 being your biggest risk and plane 3 being the one that’s actually under control. Now, I brought this deployment economics up, and that’s the plane 4. And this is the one that’s easiest for technical teams to actually get wrong. because it’s not about the model. It’s about the commercial shape that you buy the exact same model in. So here’s a real story again. I have a regulated advanced materials manufacturing company, and they were designing some AI search capability. So we mapped out the economics for this. We searched the component, you know, we used the search component, and it ran around $700 a month, let’s just say. So pretty manageable. But the bigger insight was the platform path that we ran into. By choosing the right commercial approach for their workload instead of the obvious defaults, We were, we pretty much calculated they were able to save in the order of about $50,000 a year. So real, you know, real savings, same capacity, same number of tokens, different wrapper, 50 grand. So here’s what I want you to think about. The same model can be bought per. Pay per token. That’s the way that you standard buy it. You go into Microsoft Foundry, you set up your things, it just starts billing you by the token. It’s great for variable traffic. But if you have batches and you can actually start to compute those, you can save almost half the price. But you, you know, but Then you have to change like your assumptions. Can you tolerate running the work asynchronously? It’s perfect for like large document jobs. Or can you do provisioned reserve capacity? And that’s another one where you have high, predictable high volume. So choosing the model is like, choosing the model is important, like for evals and getting like the answers right. That’s only half the cost decision, right? You need to figure out how much are you going to use and find the deployment shape so that you can optimize the amount of money that you’re investing in this as well. So next plane. And one that’s like relatively new. We haven’t really been experimenting a lot with this and most organizations don’t really have a lot here, but I want to bring it up because it’s important and I think it’s going to take important shape by the end of this year and be kind of a routine kind of discussion. probably by mid next year. So the local and the hybrid inferences. And really, there’s kind of a seductive myth that’s attached to this. And what is that? It’s that if you run it locally, it’s free. No more token bills. But that isn’t the real case at all. And I have a story for that. I’ve got a manufacturer that’s running in air-gapped clouds, right, regulated cloud environments. And they wanted to light up some copilot capabilities. And it turned out that that capability was actually double-gated. It needed GPU availability, and it needed reserve data platform capacity that simply wasn’t really available in that environment. So the cost didn’t disappear when they looked at going local, though, or sovereign. The cost and the constraint, it just moved into a different bucket. It moved into the hardware availability that they needed to purchase, or it changed into a compliance boundary. And it also was impacted by slower feature access in the cloud environment. So that’s that whole hybrid, like, am I going to run some of this locally where I can and can’t get some of the hardware that I need? Or, you know, am I going to run this in a hybrid cloud where I’m constrained by what they have to offer? And the lesson is that local doesn’t mean free. It just means that you have to plan those costs a little bit differently. It moved, maybe it moved out of your per token cloud bill, but instead of that, you’re investing in GPU hardware, now you have endpoint life cycles, support burden, and maybe even observability gaps because it’s hard to trace some of that stuff, you know, in a local environment. And now your agent’s running somewhere that the Azure cost management might not be able to see it. So a quick little bit of… vocabulary, I guess, so that nobody got lost here. Runtimes like Foundry Local and Obama, they let you run models on local hardware. And OpenClaw is a little bit different. It’s an agent framework that people have started to use that consumes local and cloud models, actually. which means it can still rack up loop and tool call costs, even when the models itself, you know, is free, right? Local inference is a great strategy for bounded, repetitive, private, or maybe even offline workloads. But it’s not just a blanket answer that’s going to solve anything. A lot of the foundation models are really powerful. But the most important thing I want to leave you with is even if you do move some of this into like a local workload, it’s not free, right? You still have to buy the hardware stuff to manage the hardware. So now let me pull, let me bring all this together, the five different plans, because they’re not just a random list. They’re really a progression that you’re going to go through. Your cost problem, it starts to mature with your architecture. And when you’re early and you’re just rolling out copilot, The whole game is license governance. Are people using the seats that we bought? But then, as the business users start to build agents, you graduate to a usage governance, credits, messages, sprawl, ownership. And then when you’re doing serious pro code and hybrid work at scale, You’re into the governance architecture, that token explosion, how I’m going to use the workload placement, you know, the things that are really hard for you to figure out. And here’s why it matters for everybody. Take a second to locate yourself, you and your organization on the curve. Most organizations are somewhere in stage one or stage two. And maybe some of you are about to hit stage 3, whether you’re prepared for it or not. The good news is that if you start at the beginning, this discipline, it compounds. The ownership and telemetry habits that you built at the beginning, when you were Governing licenses, they’re the same muscles that you’re going to use. when you need to start Governing the architecture. So don’t start over at each stage. You start to build upon those. And now let’s talk a little bit more about controlling it. So… Let me tell you about a question that I got from a consumer products company, because I think that it’s a question that probably resonates with a lot of you, and you might have already asked it yourself. They’re about to launch this customer-facing chatbot. Pretty exciting. It’s a good use case. And before go live, the leader stopped and asked, Wait, can we put a cap on this? Like they were getting good results. So what happens if this actually works? What happens if this thing gets popular? Could it hand us a mid 5 figure bill at the end of the first month? And the answer is yes, it could. I love that question because it’s the right kind of instinct. You should be thinking about that. They weren’t really afraid of failure. They were afraid of success, really, that they couldn’t control. And that’s actually a maturity moment that got me like thinking about like these kinds of topics in the Finance Ops here. Now, we’ve been mapping the cost. We’ve got those six meters that we were talking about. We’ve got the five planes. That’s the first step. But mapping, that’s not, that mapping isn’t just, it’s not managing, it’s not the same thing. So the rest of the talk, I want to actually see how are we going to get ahead of that first month spill. And the single most important control is knowing how to run the models, like like a spread. And so I’m going to bring one of my like famous little fishing analogies into the conversation. Anybody who’s fished on Lake Michigan or in the Great Lakes know that you don’t just take one line and throw it out in the water and at one depth in a hole. You build a spread. You have multiple lines, multiple lures, Everything’s at like different depths because the fish, they’re not always where you expect them to be. And they’re not actually just at one depth. And even if they are, you don’t know where they are. You have to get out there and figure out what’s actually biting. So model routing is kind of the same discipline. And it, you know, and that discipline is applied to the AI cost. Instead of every request hitting one expensive frontier model, one line, one depth, a model router helps you look at each one of the tasks in real time and sends it to the right model. Simple classification, send that to a small, cheaper, lower-end model, complex, high-stakes reasoning. Send that to the big frontier models, the big heavy hitters. You’re trying to match the model to the task the way that you match the lure, like in the left side, to the depth. And here, this is why it’s like kind of a big deal for people in this room. Model routing turns your cost policy into a runtime behavior. You’re not writing a governance memo and just hoping that like the developers are going to follow it. The routing can actually enforce it automatically on the requests. And that’s like what Finance policies look like for runtime environments. The router, and we’ll talk a little bit more about that, gives you 3 basic dials. And I want you to think of these as governance styles. They’re not developer settings, because most of the people that come to these webinars are typically leaders. And these are the ones that you actually set. The first one is a cost mode. Prioritize the cheapest model that still gets the job done. It’s perfect for like high volume, budget sensitive workloads, where you’re doing the same simple thing a million times. On the right side though, we’ve got quality mode, prioritizing the maximum accuracy. And this is for your critical outputs. Things that would make sense would be like your compliance sensitive work. maybe some legal work, the reviews, executive reporting, and like, you know, things along that nature. Anything with a blast radius is going to get expensive when the answer is wrong. And then most of the things are going to be in the middle though, right? So that balance mode, which is going to weigh cost versus Quality, and it’s got to do it pretty much dynamically and uses a sensible default for most of the things, rather than defaulting to the higher-end models. On top of those two modes, on top of the modes, there’s two more controls that matter for the governance people. You can define model subsets. Literally, you can restrict which models are allowed to be in the routing decision at all. And you can enforce that with an Azure policy that goes at the platform level. So that’s how you… That’s how you actually manage this for the developers. So that they don’t have the ability to use an unapproved model or one that’s just way too expensive for the organization or for the job. And that never even really enters the conversation anymore. So that’s how you make cheap when possible, expensive when necessary behavior. Um, you know, those those decision frameworks, you build it into the system. Here’s a, here’s a different lens, different operational version of Plane 4, that deployment decision that we talked about. And I put it in a little bit of a table so you could screenshot this if you want. We’ll get these, we’ll get these deck published out for you all as well. And So let me walk you through this. If you’re prototyping or your traffic is variable and just kind of bursty, you’re going to stay in that pay per token lane. Don’t commit to capacity that you haven’t vetted out or you don’t understand yet. But once you have predictable high volume production workloads, that’s when provision throughput starts to pay off. You’re reserving capacity because you know that you’re actually going to consume the capacity. You’ve got a big pile of documents to process, and it doesn’t have to be done instantly. Send that into a batch, roughly half the cost. Run it asynchronously. And then maybe you’ve got the offline, private, or tightly bonded workloads that need local inference. That’s a different decision that you have to work through. The mistake though that I see, you know, people making is staying on one interactive pay per token rate for these huge giant batch jobs that pretty much could be run overnight, you know, at half the price. Or maybe it’s the opposite, reserving a bunch of capacity before they really have an idea how much that they’re going to use, or what the real demand is. So you’ve got to learn to match the shape to the workload, and don’t just default to one or the other. Now, with this routing and the deployment shape, I think now we can start to answer some of the other questions, like that $50,000 question that we had before. But let’s zoom out for just a second, because the model landscape itself is what makes all this both, you know, possible. but it also makes it necessary. And we’re not in just like a single model world, you know, anymore. Most of us are using different tools for different workloads for different reasons. And the enterprise stack now is starting to look more like that, like a portfolio. I’ve got frontier models, like the big open AI models, that can do a lot of the heavy lifting. But I also have small, efficient models, things that can run on Microsoft ecosystem like the five family of models. Speaking of Microsoft’s models, they also have their own MAI models that are coming out that were announced in build. Haven’t done a lot with these yet. It’s slow rolling them out. There’s some that you can experiment with, but they’re being rolled out on a limited way. Then you’ve got other models too, like partner models, like Anthropic, maybe Mistral, others. And those are still available like in Foundry, so you have the ability to govern those as well. You’ve got open weight, you know, llama family models. And then you’ve got the local on-device things that are coming out as well. So. Some of the specifics on Microsoft’s, like I mentioned, this models are coming out mostly from the trade press and it’s not baked into the official pricing documentation, but I do know that it’s going to affect some of the pricing numbers. So I don’t want to like, don’t take any of this as like, I have access to things you don’t have. Don’t quote me on this. I will deny everything that I say. You know, so, but there are some anecdotal, you know, stories about these MAI models maybe reducing cost by like 50 or, you know, 50% if you use the Microsoft models in some of these use cases. But we haven’t seen it yet. And so I don’t, I’m not making any promises. All right, so portfolio. Portfolio is not about picking winners. It’s about what makes the workload placement possible, really. The governance opportunity that I want to think about is how can I route each task to the smallest, cheapest, and most compliant model that still clears the quality bar. Most, you know, more choices don’t make governance less important. It actually makes it more important. And understanding what that portfolio enables you to figure out where you’re going to place those workloads. And the governance controls that you put in are put in place are going to want it’s going to make this safe. So back to that consumer products company that I was talking about and their $50,000 bill. Here’s the answer that we started to walk them through. And notice that it’s not like just one single magic setting. It’s actually a sequence events. We start at step one. Start on that pay as you go. Turn caps, turn on caps, turn the limits on. and start to observe. Don’t commit to any kind of any kind of spend or an approach at, you know, at that point. Step 2, measure it. Maybe it’s 30 days, maybe it’s 60 days. I don’t know what the right number is, depending on what it is, what the workload’s doing, but watch the actual usage. Watch what the real cost per completed outcome is. Don’t just guess about it. I mean, you’re going to start with the guess, but let the actual behavior start to show up. And then step three, commit. Now you know whether you want to use prepaid credits or reserve capacity, because you’ve actually shown what the demand curve is going to look like. And you know, I mentioned this before, but reserve capacity can only save you money if you’re using it. If you bought $1,000, you know, at 50% discount, and you only spent $500 of that actual credit, you didn’t really actually save yourself any money. So that’s the whole discipline kind of in a slide. Measure first, commit second, right? The, you know, Yeah. You want to make sure that you’re scaling. You want to make sure that you’re scaling when you know what the price actually is, not before you price what the single unit is. And so what’s a framework? Everything that we’ve covered so far, we’ve got the five planes, the routing, the deployment shape. Mesh and then commit. I tried to reduce this into like a single slide, something that you can take back, you know, and apply to the workloads before you start to build it. So I came up with three, six different steps on the screen. First, classify the workload. What is this thing actually supposed to be doing? If you don’t know that, it’s really hard to understand what the problem space is going to be. Second, determine its value and its risk. How often does it run? How much is, what is it worth? You know, what is it going to do for you? What’s the cost of being wrong? I talk about that blast radius A lot. And then third, choose the experience plane. Is this a Copilot license play? Am I going to be building a Copilot Studio agent? Is this something I’m going to move into Pro Code like Foundry? Or is it something that I can build using local AI or running, you know, on a local job? The next one is choosing the model and the runtime. Do I need a small model? Do I want to use a frontier model? Is this a place where I can use that router to help make that decision for me? Batch or local would be some of the other decisions that you want to use for that runtime. And then set the controls before production hits. And what I’m talking about here are owners. budgets and that budget having caps. Did I incorporate the logging? Do I have the right evals in place so that if I want to switch models and can, you know, out later, I can measure if it’s still doing the things I expect to do. And finally, optimize it after the launch, because your first choice is probably not going to be the best choice and you only figure out the best choice by watching and learning. So… The beauty of this is that it’s pretty much plain agnostic. This framework is going to work whether the right answer turns out to be $30 license or if it’s a custom Foundry agent. And I want to take this framework and map it to a real workload start to finish. So let’s say that you, the use case I want to talk about is something that we do quite a bit. It’s this bulk document processing. Say you’ve got 10s of thousands, thousands, hundred thousands, you got all these documents, you need to extract the data out of every week. So you’ve got, you know, you’ve got this, you know, this data. Step one is to classify it, right? This is, by definition, it’s going to be a very high volume document extraction. Step 2 is that value and risk. So it’s high volume. It does not need to happen in real time. And the risk that kind of makes it a little bit more moderate, right? So a human reviews the output later on downstream would be kind of the way that I would look at it. Step 3, What’s the experience like for this? I’d say this isn’t a copilot license play. It’s probably a pro code play, and probably an API-driven job. So moving on to step 4, and this is where that money is going to kind of come into play. The model and the runtime. You don’t need a frontier model to extract fields, right? I mean, that just, you don’t need to do it. So you pick a small, a smaller, I should say, and more efficient model. And because you don’t need these answers to be instinct, you can run them at night, once a week. You can run it in a batch mode. Then you move to step 5, the controls that you need to put into place before production. You have to assign an owner. You have to set a budget cap. You have to set up your evaluation gates so that you can confirm that when you move down to the cheaper models, that the accuracy of these still hold. And then finally, optimizing once it’s running in a batch, once it’s running in that batch, look at those rates and see where they land and see if you can get like a batch rate or a reserved incident rate that’s going to help you save some money. So in one clean pass through this framework, we went from having a document problem to a defensible cost, you know, cost optimized decision using, you know, a decision framework, right? I can switch between a small model, I can switch to a batch or not batch. I have those controls and I can measure what those, what the cost impacts are. No more guessing, no more surprise bills. And so what are some of the ways that you, as you know, you as people here in this room, can take this, you know, back and start working on next week? I’d say here’s like the 90-day version. And so first 30 days, I would say the goal here is just to stop flying blind. Start to inventory your AI experiences that you’ve already got in play. The copilot licenses, the studio agents, how many Foundry projects do you have up and running? Do you have any local environments that people are experimenting with? And assign an owner and a cost center to each one of those so you can start tracking it. confirm the billing policies are actually configured so that you have those caps put in place, and then set some initial budgets and alerts. That’s it. You’re not optimizing anything yet. You’re just turning the lights on. In days 30, though, from 31 to 60, you can start to instrument and govern. Now you add real telemetry to your custom agent so you can see where the token spend is. You can look at those tool calls and how often it has to do a retry. You start to, in this area, you can start to define an approval model catalog and start to lock down those models that you don’t want to use because they’re too expensive. You set the tool call and the retry limits so nothing gets into like these forever loops. And then you start reporting so that the business units can actually see the cost that they’re creating. And then day 60 to 90, now this is when you can start to optimize. Now you compare that pay per token versus the batch and versus provision for your big workloads. You can turn on that model routing because you’ve got the governance in place, the instrumentation in place, you can turn it on and see where it actually helps. Maybe at this point, this is where you start reclaiming the licenses that nobody is using. And then you can stand up recurring quarterly portfolio reviews. So this becomes a habit and not just a one-time exercise. So if you were to think back of everything that I’ve talked about today, and you were to think about one operating rule from this talk, Make it this one. Every production agent gets three things before you ship it. It needs a business owner, absolutely somebody accountable for the agent, a technical owner, somebody who knows how that agent behaves, and then finally a cost center. If it doesn’t have all three, Don’t let that thing go into production just yet. That one rule prevents most of the sprawl that I’m seeing organizations deal with and a lot of the surprise bills that we’ve talked about today. Because it means, if you look at this, that somebody’s accountable for its value and somebody is accountable for the cost. And yeah, thumbs up. Because, you know, I’d rather just tell it to you like straight and be honest with you rather than try to sell you something. Standing this up from scratch with the telemetry, the governance, and the operating model, it’s really hard to do the first time. It’s a muscle memory that most people aren’t used to doing. It is going to slow down the pilots. You know, and I’m just being transparent. So what we thought was like, oh, it just takes, I’m going to have clog code run this and I’ll be up and running in two days, turns into, you know, three, four, five, six weeks. The reason why is because you’re preventing that from happening in the future. And that’s actually the kind of stuff that we do at Concurrency. So bridging that agent ops and FinOps and building that control plane so that you can scale these AI birds without flying blind, that’s the meat of what we do here. And I would say if any of that resonates with you, I think that Amy probably dropped a link. Yep, there it is. Amy dropped a link in here. Connect with us, give actually feedback too, give me feedback on whether you like this talk or not. So let me bring this home. If I was going to think about the single biggest mistakes that organizations make is that they wait. They wait for the surprise bill to show up, and then they scramble to build all the controls after the fact. Let’s flip that script together, right? Don’t wait for that bill to build the control plane. Start thinking about that control plane first. And then you’re not going to get surprised by those bills. And that’s the mindset shift that needs to happen against underneath all of this. Stop treating it like as a bunch of disconnected pilots, each with its own meter and each with its own blind spot. Start Governing these things. I know that people are starting to bring AI product owners or AI enablement, you know, or parts of the organization. Those are all great things that we should be thinking about. But at the end of the day, you need to know which workloads belong in the copilot, which belong in the studio, which need Foundry, and having people like that that can help you manage through that. think through that process is going to save a ton of money for you and the organization. Governing the portfolio, instrument the agents, route intelligently, and measure against your outcomes. That really is what Tokenomics is going to mean for these organizations. All right, so we’ve got a few extra minutes here. I’ll leave you, you know. with just a thank you for everybody that was here. And I’ll take a look at some of these questions in the chat if anybody has anything they want to ask too. I’ll stick on you for a minute. What do you use for the agent catalog to track that metadata such as owner attestation, et cetera? Nate, that was a really good question. So Agent 365 has some of that controls baked into it now. And that’s something that I think is, well, it’s definitely something that people are starting to investigate and look into. It’s brand new. There’s some licensing trade-offs and things like that. Matt. You had a question about how should we think about handling technologies that have already been deployed? That’s a great question. I think I talked about that in my 30 day, you know, plan. You gotta start to inventory those things. You’re not gonna fix it right out of the gate. And I don’t think that you should like have that as an expectation for the organization. But you can’t start to govern if you don’t have an inventory of what those things look like and start to assign those owners to it. We are looking at Agent 365, but it is also another license and cost for each user for that. And like in typical Microsoft fashion, very difficult to understand that licensing. Nate and Tracy, you’re spot on with that. And honestly, I only know one or two organizations that have really started to roll that out. So my advice right now is still fairly anecdotal as we start to work through that. But love to connect with you and see what you’re thinking about how that impacts your users. Um… Let’s see what else. Expected completion dates are determined for, sounds like how, yep, expected completion dates are going to change now that you’re doing some of these controls. And then, yeah, all right. Great questions. With that, I’ll say thanks for everybody that joined. Don’t forget to click that link and fill out that survey. Thanks for joining us.

Tokenomics: Governing AI Costs Before They Escalate

WHAT YOU’LL LEARN

Understanding the New Economics of AI

The Five AI Cost Governance Planes

Bridging AgentOps and FinOps

Designing Agents with Cost in Mind

Optimizing Workload Placement and Model Selection

Building a Sustainable AI Governance Framework

FREQUENTLY ASKED QUESTIONS

Is this session technical or business-focused?

Do I need advanced AI experience to attend?

Who is responsible for AI cost governance?

Is AI becoming too expensive for enterprises?

What is the biggest takeaway from this session?

ABOUT THE SPEAKER

TRANSCRIPT

Transcription Collapsed Transcription Expanded

Best of FabCon Europe: The Fabric Highlights That Matter

Show Me the Money: Measuring Real ROI on AI Agents

Copilot 201: From Adoption to Enablement

Tokenomics: Governing AI Costs Before They Escalate

WHAT YOU’LL LEARN

Understanding the New Economics of AI

The Five AI Cost Governance Planes

Bridging AgentOps and FinOps

Designing Agents with Cost in Mind

Optimizing Workload Placement and Model Selection

Building a Sustainable AI Governance Framework

FREQUENTLY ASKED QUESTIONS

Is this session technical or business-focused?

Do I need advanced AI experience to attend?

Who is responsible for AI cost governance?

Is AI becoming too expensive for enterprises?

What is the biggest takeaway from this session?

ABOUT THE SPEAKER

TRANSCRIPT

Transcription Collapsed Transcription Expanded

Other Events

Best of FabCon Europe: The Fabric Highlights That Matter

Show Me the Money: Measuring Real ROI on AI Agents

Copilot 201: From Adoption to Enablement