Insights View Recording: AgentOps: Running AI Agents in the Real World

View Recording: AgentOps: Running AI Agents in the Real World

AgentOps: Running AI Agents in the Real World

Building an AI agent demo is easy. Running agentic systems reliably in production is where the real work begins.

As teams move from prototypes to real applications, they run into a new set of engineering problems: how do you evaluate non-deterministic behavior, trace agent decisions, monitor tool usage, manage prompts and policies over time, and detect when quality starts to drift? This session is a practical look at the operational side of agentic systems, including observability, evaluations, versioning, debugging, feedback loops, and rollout strategies.

Rather than treating AgentOps as yet another layer of hype, we’ll focus on the patterns that actually help teams run AI systems with more confidence and less chaos. You’ll leave with a practical framework for making agents more measurable, supportable, and production-ready.



As organizations move AI agents from flashy demos into real production workflows, many teams hit the same wall: agents that looked great in a pilot start behaving unpredictably the moment real users, real data, and real consequences are involved. Between rising expectations, model and prompt drift, and pressure to “ship agents,” it’s easy to underestimate how much operational discipline production agents actually require.

In this webinar, we break down AgentOps—the practices, patterns, and tooling that close the trust gap between a promising agent and one that runs safely in production. You’ll learn how observation, evaluation, gating, and control come together to keep agents reliable as models change, prompts evolve, tools are added, and data shifts underneath them.

Brian Haydin (Solution Architect at Concurrency, co‑organizer of the local Global AI Community) walks through what AgentOps is, why traditional testing isn’t enough for probabilistic systems, and how teams can adopt it today using the Microsoft Agent Framework, Azure AI Foundry, OpenTelemetry, and the new Agent 365 governance layer. The talk is grounded in real project work—including the Friday‑afternoon deployment that started misrouting tickets while every dashboard stayed green.

WHAT YOU’LL LEARN

  • What AgentOps actually is—the four-part loop of observe, evaluate, gate, and control—and why it’s the missing layer between agent POCs and reliable production deployments.
  • Why agents fail differently from traditional software: drift, silent regressions, wrong tool selection, and confidently wrong final answers that mask broken intermediate reasoning.
  • How OpenTelemetry is becoming the plumbing for agent observability, and how it integrates natively with the Microsoft Agent Framework, Azure AI Foundry, and other ecosystems.
  • The state of the Microsoft agent stack in 2026, including:
    • Agent Framework GA (April 2026) as the unified successor to Semantic Kernel and AutoGen, with long-term support for .NET and Python.
    • Foundry tracing GA, new guardrails (task adherence, prompt optimization), and Foundry MCP servers.
    • Agent 365 GA and what it means for governance, identity, and licensing.
  • What an eval really is—the six components every eval needs (scenario, input, expected behavior, assertion type, scoring, business impact)—and why your best evals come from production failures.
  • Why versioning matters more than teams realize, including what to track: model versions (and hidden provider system prompts), prompts and instructions, MCP tool contracts, evaluator models, policies and guardrails, and RAG indexes/embeddings.
  • The deployment ladder that replaces “demo → production”:
    • Dev and local testing
    • Offline eval gates with golden datasets
    • Shadow mode (the step most teams skip)
    • Canary rollouts with small traffic slices
    • Human‑in‑the‑loop review
    • Scoped production deployment
    • Full autonomy with monitoring and defined rollback triggers
  • The intervention ladder for restraining a misbehaving agent: monitor → review mode → restricted tool calls → capability disable → full kill switch.
  • How Foundry’s control plane and Agent 365 identities let you inventory agents, monitor health, and revoke access the same way you’d offboard an employee.
  • What to actually monitor in production—latency, quality scores, safety flags, token consumption—and why picking 3–4 metrics that matter beats a 100‑metric dashboard.
  • A practical starter kit you can implement this week: add tracing, capture tool call inputs/outputs, define 5–10 golden prompts, add one CI/CD regression check, put a human in the loop on one risky action, document a kill‑switch playbook, and start measuring cost.

FREQUENTLY ASKED QUESTIONS

Is this webinar a product demo?

No. Live tooling and patterns from Azure AI Foundry, the Microsoft Agent Framework, and OpenTelemetry are referenced, but the focus is on the operational discipline and decision points behind running agents safely—not feature walkthroughs.

Why can’t we just unit‑test our agents like normal software?

Traditional software fails like a machine—deterministically and repeatably. Agents fail like a confident junior employee: fast, eager, and just wrong enough to be expensive. The same prompt can produce different answers, model swaps cause silent behavior changes, and polished final answers can mask broken intermediate reasoning. AgentOps exists to make those failure modes visible.

What is “drift” and why is it so hard to catch?

Drift is when the final answer looks correct but the path the agent took to get there was nonsense. Because outputs still appear plausible, drift hides from logs and error rates—dashboards stay green while customers start complaining. Trace‑level evals across each step are how you catch it.

How is AgentOps different from MLOps or DevOps?

AgentOps borrows the discipline of CI/CD and observability but adapts to probabilistic, multi‑step workflows. It treats every step in an agent run—planning, tool calls, retrieval, synthesis, downstream actions—as a distinct evaluation point, and adds gating and control mechanisms (approval modes, tool permissions, kill switches) that traditional pipelines don’t need.

Do these patterns only work in the Microsoft/Azure ecosystem?

No. The talk leans on Microsoft Agent Framework, Azure AI Foundry, and Agent 365 for concrete examples, but the underlying patterns—OpenTelemetry tracing, trace‑driven evals, versioning, deployment ladders, intervention ladders—apply to any agent platform.

What’s the biggest takeaway for teams getting agents into production?

You’re not behind—only about 11% of organizations have agentic AI actually running in production. Don’t go straight from demo to production. Build the loop—observe, evaluate, gate, control—pick 5–10 critical evals, define your rollback triggers and kill switch before you need them, and let real production scars feed your test suite over time.

ABOUT THE SPEAKER

Brian Haydin is a Solution Architect at Concurrency, a Microsoft‑focused systems integrator based in Milwaukee, and a co‑organizer of the local Global AI Community. He works with enterprise teams to take AI agents from pilot to production, focusing on the observability, evaluation, and governance patterns that keep agents trustworthy as they scale. He speaks regularly across the country on AgentOps and the practical realities of running agents in real environments.

TRANSCRIPT

Transcription Collapsed

Brian Haydin And welcome everybody. I’ll go slow here on the introductions just to give people a few minutes. I know we had quite a lot of people sign up at the last minute. So a little bit about me. I’m Brian Hayden. If you don’t know me, I’m a solution architect here at Concurrency. 0:0:25.216 –> 0:0:46.976 Brian Haydin We’re A Microsoft-focused systems integrator based out of Milwaukee, Wisconsin, you know, pretty much right in the heart of the Midwest. Around me personally, I also am a co-organizer of the local global AI community. And so doing, you know, webinars, getting out and speaking and sharing, you know, things is something that I really enjoy doing. 0:0:47.56 –> 0:1:8.136 Brian Haydin So today, we’re going to talk about a part of AI Agents that, you know, a lot of people haven’t really been talking about until recently. And I’ve been doing this talk, this is maybe the 6th or 7th time that I’ve actually done this talk, first in a webinar format, maybe the only one in a webinar. But, you know, a lot of people are really starting to talk about this. And 0:1:8.216 –> 0:1:30.496 Brian Haydin This is where you actually have to take the pilot and run these things in production and make sure that things aren’t breaking. So I just want to be clear that everything that I’m going to talk about today is from real project work that we’ve been doing here at Concurrency. The stories that I’m going to talk about are things that we’ve actually had to work through and 0:1:30.536 –> 0:1:49.296 Brian Haydin built the scars and the patterns in order to, you know, to have this really hold up, you know, when the demo guides aren’t really smiling at you. So there’s QR code on the screen if you want to follow me on LinkedIn. I do a pretty regular newsletter and I post some of my thoughts. 0:1:49.816 –> 0:2:8.96 Brian Haydin And then if you’re, you know, well, just doesn’t really matter where you are. But I’ll be doing this talk, you know, around the country. I’ve done it in Orlando, down in Chicago. I’m going to be doing a talk in Chattanooga later this spring. But follow me on LinkedIn. Maybe I’ll be in a spot near you. 0:2:8.576 –> 0:2:27.336 Brian Haydin So getting into the weeds. So, you know, how many people, am I going to do a poll? Yeah, I do have a poll, but like just sort of mentally think how many of you have actually deployed stuff on a Friday afternoon. And I, you know, I used to think to myself, why would people want to do that? And 0:2:28.176 –> 0:2:46.416 Brian Haydin You know, this story is going to be about a Friday, you know, that probably is going to resonate with the most of you. But there are some reasons why people want to do this. First off, like if I’m running a website, my e-commerce shows the traffic doesn’t, you know, really hit on Friday night, people are out doing dinners and fun things, you know, that that might be a reason why you want to do it. 0:2:46.896 –> 0:3:9.456 Brian Haydin But it’s not really enjoyable for people like me who have to be, you know, in the war room triaging these things. So we had this like support triage agent that was routing customer service tickets, you know, just real simple, you know, kind of route this agent, you know, route the tickets to the right team. And, you know, everything was working pretty much just the way it expected. And one Friday. 0:3:9.536 –> 0:3:29.736 Brian Haydin you know, 437, we pushed what we thought was just some minor improvements, right? A small tweak to the prompt, changed the model to like something a little bit newer. And all we were doing was just trying to improve the ticket classification. And at first, everything looked like it was going great. It was faster. The answers were more confident. 0:3:30.416 –> 0:3:49.136 Brian Haydin The routing seemed to be pretty clear, but a little after five, we kind of saw the first like little subtle clues that things were running correctly. A misroute here, a misroute there, and a billing ticket that landed in the, you know, support queue instead of like where it was supposed to go, but not enough for us to like, you know. 0:3:49.696 –> 0:4:8.896 Brian Haydin hit a pause button or, you know, really, you know, gets you concerned. But by 520, you know, 15 minutes later, things started to like really aggregate, right? You start to see that clustering logic that something isn’t right. The grouping of unrelated tickets together, just because they had some keywords that were there. 0:4:9.936 –> 0:4:30.496 Brian Haydin Obviously, we’re seeing something behaviorally that was wrong. And by 530-ish, you know, we had switched the whole thing off to like a human review mode, and we stopped all the automatic routing. Here’s the kicker, though. The infrastructure was showing everything was green. If you were sitting in the back room saying, how’s everything working, you’d look at your screen. 0:4:30.896 –> 0:4:50.16 Brian Haydin Logs are clean, error rates were normal. The agent was just confidently wrong, right? And so we weren’t able to really catch it until the customer started noticing and the customer started making the complaints. And honestly, this is what makes agents different. So this isn’t really hypothetical things anymore. What we’re seeing is… 0:4:50.96 –> 0:5:8.976 Brian Haydin You know, more public examples of the Frontier Labs treating capability and deployment risk as like 2 separate problems. Anthropic recently announced a gated release for Claudio, the mythos, you know, story, and for a broader public release, but they said that… 0:5:9.136 –> 0:5:28.256 Brian Haydin It’s the most capable model yet, but it was being made available only as a gated research preview for people that were working on critical defender issues. Why? Because it was getting out. And so even people building the frontier models are telling us, you know, that the capability is accelerating. 0:5:28.656 –> 0:5:49.936 Brian Haydin faster than our comfort level is with these uncontrolled like agent deployments. So the patterns that we’re going to be talking about today are really about promising, how you keep a promising agent from becoming an investment, you know, an expensive incident that you have to go back to your leadership with. So I got a quick little poll here. I want to get like a pulse from the room. 0:5:50.976 –> 0:5:51.536 Brian Haydin Um… 0:5:53.456 –> 0:6:13.456 Brian Haydin How many of you, you know, have read about Agents? How many of you have actually built an Agents? How many of you have actually shipped an agent? And how many of you just, you’re kind of like the users and been, you know, gone through it? This will help me kind of like ground maybe some of the conversations. 0:6:14.96 –> 0:6:33.536 Brian Haydin So I’m seeing a little bit of shift here. We’ve got some responses. People have built some agents. That’s awesome. I’m really happy to see that. But, you know, honestly, this isn’t really too much of A surprise to me. Most people right now are working in the space that you guys are. 0:6:33.856 –> 0:6:44.16 Brian Haydin I’ve either read a little bit about Agents, I started experimenting with them, um, I’ve, you know, maybe built it, but I haven’t really gotten to a point where I’ve shipped it, uh, or, you know… 0:6:45.296 –> 0:7:4.496 Brian Haydin I don’t even see a single response for paging it. Usually you see some. So, but hey, at the end of the day, what the data shows is that we’re in this real awkward kind of teenage phase of agent adoption right now. Everybody’s been excited, the demos have been going really incredibly well, and the pilots are really starting to show promise. 0:7:4.936 –> 0:7:24.416 Brian Haydin And then about half of those projects just quietly stall out someplace between a POC and the production. And if you’re looking for the answer why, if you ask enterprise teams with keeping them up at night, about 51% of them are going to tell you that it’s managing and monitoring these things at the scale. They don’t just trust them yet. And honestly, they’re probably right. 0:7:25.56 –> 0:7:44.816 Brian Haydin Gartner thinks that more than 40% of agentic AI products are going to get canceled by 2027. And at the same time, they’re predicting around 33% of enterprise software is going to include agentic AI by 2028. And both of those statistics, they can be true if the projects that survive are the ones that figure out how to run safely in production. 0:7:45.696 –> 0:8:5.136 Brian Haydin And here’s the number that’s on this list that should get your attention. 11% of organizations, they have agentic running in actual agentic AI running in production right now. Not experimenting, not piloting, just actually running. And so the message here is that you’re not 0:8:5.176 –> 0:8:24.856 Brian Haydin really behind. It’s not like 100% of companies, 90%, 80% of companies have already figured this out. It’s not too late for you to start thinking about doing this the right way so that you’re not one of those statistics of 40% of those agent projects being canceled by the end of next year. Let’s talk about the problem though. 0:8:24.976 –> 0:8:44.496 Brian Haydin The problem isn’t that teams are lacking ambition. The problem is that they don’t trust these things to behave consistently. Let’s look at this funnel for a second. I started with a POC, and the POC went great. And I graduated it up over to the pilot, and the pilot shows that it’s delivering value. And then I hit this wall. 0:8:44.576 –> 0:9:5.56 Brian Haydin that I’ve started to characterize as a trust gap. I want to understand like, you know, the three questions that are killing the momentum at this point. Is it going to behave consistently? Unlike traditional software, correct is often a little bit squishier when you start talking about agents. The same prompt might produce a different answer. 0:9:5.456 –> 0:9:27.696 Brian Haydin and all of it is looking plausible. Another question is what happens when you change something? If I swap in a new model, maybe I’m tweaking a prompt now with like tools usage, I start to add a tool. Suddenly the agent might start behaving a little bit differently. And when I go back to talk to my team about it, people are just shrugging their shoulders and saying, I can’t really, I can’t explain why it’s happening. 0:9:27.736 –> 0:9:48.16 Brian Haydin And that leads me to the last part, which is drift. And drift is a little bit weird for people to understand, a little harder for people to understand, because at the end of the day, what it means is that the final answer is looking correct, but the path that it took to get to that answer was complete nonsense. And unless you can answer all three of these questions with something better than I guess, 0:9:48.416 –> 0:10:7.656 Brian Haydin or we’ll keep an eye on it, it’s going to cause this trust gap that I’ve been talking about. So human oversight, you know, human oversight is where we kind of live in that trust gap. And it’s not really a failure state, but what it is, you know, what it’s turning into be is it’s… 0:10:8.336 –> 0:10:19.256 Brian Haydin an expensive like operating model when you have to have people still continue to touch this stuff. And so AgentOps is kind of what we’ve come up with to help close some of that gap. So. 0:10:21.576 –> 0:10:42.616 Brian Haydin Why can’t we just test our way out of this? Traditional software, we use things like unit tests, and that fails pretty much like a machine would. It’s deterministic, it’s repeatable, a good, you know, I don’t know if anybody’s like an old C or C coder like me, but 0:10:43.96 –> 0:11:3.496 Brian Haydin You know, you look at a null pointer, you find a null pointer, you fix the null pointer, you move on, right? Pretty easy, you know, pretty easy to solve for problems like that. But agents, they feel they fail a little bit more confident, like, like confidently, like a junior employee would. They’re fast, they’re eager, and they’re just wrong enough to make expensive mistakes. 0:11:3.656 –> 0:11:23.616 Brian Haydin So what do we, you know, what does this look like in practice? Drift after a model or prompt changes. When I swap in a new model version and suddenly the agent, you know, starts to interpret ambiguous results, you know, requests a little bit differently, that’s something I want to look at. I’m looking at silent regressions. 0:11:23.736 –> 0:11:43.416 Brian Haydin The answers might look fine, but underneath it, what’s changed? What wrong tool selection happened in a multi-step workflow? Did the agent pick the wrong tool early? Did it skip it? Like what actually happened to make this turn into a bad decision? And you know, one of the things that like is my personal favorite, a polished, 0:11:43.736 –> 0:12:4.776 Brian Haydin confident final answer that completely masks the fact that an immediate intermediate reasoning was complete nonsense. And here’s what feels especially relevant in 2026. Prompt versions change, prompt version changes are one of the biggest sources of silent behavior. And honestly, it’s not just the prompt changes, but it’s the ones that you don’t even know about. System 0:12:4.816 –> 0:12:27.256 Brian Haydin prompts that exist over system prompts. And so how do I create policy and guardrails that up that, you know, can maintain and detect some of these changes? What I’m going to introduce next is this, or what I’m going to introduce a little bit later is this idea of concept or this concept of evals that you may have heard about. And that’s going to give you the visibility into what agents are actually doing and not just what it says. 0:12:27.896 –> 0:12:46.616 Brian Haydin But let’s take a step back and just define some of the principles of AgentOps. At its core, it’s really just four things. I’m going to observe, evaluate, gate, and control agents, you know, agents that I build. On the observe side, this is actually telling me what’s happened. This is the play-by-play, like. 0:12:46.696 –> 0:13:5.96 Brian Haydin the actual full play-by-play or the agent. I’m looking at traces, I’m looking at exactly what inputs came in, what outputs were put out, which tools got called. I’m measuring things like latency and token consumption. But when I move to evaluate, this tells me whether what happened was actually a good thing. 0:13:5.576 –> 0:13:24.296 Brian Haydin And here I use things like golden prompts and evaluation data sets. I use scoring strategies and regression checks. And this is where I’m going to start to code into my application what the expectations are. And then I have the gate. This decides whether an agent’s ready to ship or… 0:13:24.456 –> 0:13:43.656 Brian Haydin you know, ship to production or not. Did it pass all the quality thresholds that I set up? Did it regress in any of the critical test cases that I set up as part of my gate? And if it did, and I can state that those things happened, did I not deploy this or did I let it fall through? 0:13:43.856 –> 0:14:4.616 Brian Haydin the CICD gap and start failing production. The last concept that I have is talking about control modes and approval modes and tool permissions. I have to have the ability to disable some specific functionality or capability, or even better, just stop the whole agent when it starts to behave, you know. 0:14:4.736 –> 0:14:18.376 Brian Haydin when it starts to behave in a way that’s unexpected. And so those four things kind of create this central loop around an agent life cycle. And that’s what this loop is meant to do, is to build the trust to get you over that trust gap. So. 0:14:19.656 –> 0:14:39.936 Brian Haydin A lot of my conversations are focused around the Microsoft sphere, but I will tell you that this is all across the multiple platforms. This is what people are talking about. In fact, the demos that I do when I do this live are using OpenTelemetry, and that’s part that’s 0:14:40.56 –> 0:15:0.456 Brian Haydin open source framework that anybody can be used, they can be using. Foundry uses it, the agent framework uses it natively, and so are, you know, all the other ecosystems that you might be developing in it. So while I’m going to talk more intelligently about the Azure ecosystem, these concepts absolutely apply no matter where you’re building and deploying your agents. 0:15:1.96 –> 0:15:20.136 Brian Haydin So let’s talk about what’s happening when a user asks, you know, a question of an agent and the workflow that it goes through. An agent run isn’t really just a single model call. It’s actually a small workflow that’s wearing like some sort of a chatbot costume, right? A user sends input. 0:15:20.376 –> 0:15:44.496 Brian Haydin and an orchestration layer decides what to do with it, maybe does some planning, maybe breaking the request in some smaller small subtasks, makes additional model calls. That model might decide that it needs to access a tool, search some docs, you know, query your database. Maybe it pulls, you know, memory out of a vector search or other memory and it synthesizes everything into a final answer. 0:15:45.16 –> 0:16:3.816 Brian Haydin And then, and this is what keeps people up at night. Maybe that answer triggers some sort of a downstream action. Could be sending an email, updating a record, routing a ticket, and you know, in kind of the worst case scenario, deleting a production environment. So every one of those steps, if you look, you know, on the screen, you get that red. 0:16:3.896 –> 0:16:22.456 Brian Haydin you know, kind of lightning bolt, that’s a failure point. And if you’re only evaluating at the final answer, you’re completely flying blind in the middle. What we want to make sure is that like we’re measuring each one of those steps. And that’s why traces matter. And that’s why evals are needed to check more than just did the output get where it needed to go. 0:16:22.856 –> 0:16:41.896 Brian Haydin I’m going to switch over here to the Q&A. My favorite agent evaluation frameworks. So there’s a great eval framework. Reach out to me, Kyle, afterwards, and I can share some of it. But it’s native inside of the agent framework. Microsoft has had some really great updates on that. 0:16:42.536 –> 0:16:43.176 Brian Haydin Um… 0:16:44.856 –> 0:16:53.896 Brian Haydin So, all right, getting back to the next slide. Let’s take a look at the AgentOps stacks. So, you know, 0:16:57.896 –> 0:17:16.856 Brian Haydin We’ve got, we’ve got a bunch of different tools that we can use in AgentOps. I mentioned a couple of them, you know, right out of the gate. Microsoft Agent Framework, not entirely new. This is kind of a synthesis between a couple of different, you know, a couple of different 0:17:16.936 –> 0:17:36.776 Brian Haydin tools. So we’ve got like Semantic Kernel, we’ve got Autogen, and we brought that together into a more cohesive agent framework. I mentioned already that I’m using OpenTelemetry, you know, for a lot of my projects. And that’s going to map into, you know, app insights and, you know, and there’s new blades in the Azure. 0:17:37.176 –> 0:18:0.376 Brian Haydin that allow you to like really dive deep into it. So we’ve also got like some really great features. This last time I did this talk down in Chicago, I walked through AI Foundry, how we could use the goal, we can automatically generate golden prompts, we can automatically generate tests, we’ve got the ability to look at the telemetry right inside of 0:18:0.856 –> 0:18:20.296 Brian Haydin of Azure Foundry. And then I can hook all that up into my GitHub actions so that I get safe deployment, you know, through Azure DevOps or my GitHub actions. So, you know, these are kind of the things that we’re using on a regular basis. And 0:18:20.336 –> 0:18:40.376 Brian Haydin A lot of it, as you can see at the stickers, is new stuff that’s coming out in 2026. So specifically, I want to like dive a little bit deeper into the Microsoft Foundry. Half of what we’re talking about, you know, for this time is stuff that got real, real in the last six months. 0:18:40.896 –> 0:19:0.856 Brian Haydin And so let’s just take a little bit of time to calibrate on some of that. The big one right here is in the middle. And this is, what is it, a month ago, April 3rd this year. Agent Framework hit general hit general release. It’s now production ready and it’s available for.NET and Python. 0:19:1.176 –> 0:19:21.96 Brian Haydin all in kind of a single SDK, and they have, with it being in GA now, it’s got the long-term support commitment from day one. If you’ve been waiting to build things on Semantic kernel and Autogen, I’ve been talking about these things in some of my other talks and webinars. Like, you don’t have to like… 0:19:21.136 –> 0:19:42.696 Brian Haydin decide now, this is like which one should I be using? The right framework to be using is the agent framework. And around that, the platform has been pretty busy as well. So December was kind of foundation laying. Foundry MCP servers, you know, were in preview. The open telemetry, Gen AI semantic conventions were. 0:19:42.856 –> 0:20:3.416 Brian Haydin started to become standardized. March is when tracing went to like general availability. And I think that’s the one that’s going to matter a little bit like the most, I would say, for what we’re talking about today is the telemetry aspect of it. April, we had some new guardrails in preview, task adherence and prompt optimizations. 0:20:3.816 –> 0:20:21.816 Brian Haydin I would, you know, take a look at both of them. They’re worth it. And then not to what we’re looking at like three or four days ago, Agent 365 hit general availability. So Agent 365, I’m like, I’ve done a lot of reading about it, haven’t been able to do a lot of experimenting with it. 0:20:22.456 –> 0:20:38.536 Brian Haydin But that’s the M365 layer that like the governance story for organizations is really going to hit. It’s going to cause some changes in some of the licensing models and usage stuff. But those are those are some of the hot topics from the last last couple of weeks. 0:20:39.696 –> 0:20:40.136 Brian Haydin So… 0:20:42.216 –> 0:21:0.696 Brian Haydin We’re almost ready to get into the meat of this, but you know, but before I do, I want to give you somewhat of a lay of the land what’s happening in 2026 and what people are talking about. And this is the stuff that’s actually starting to stick and I’m seeing the team start to build pretty consistently with. 0:21:1.656 –> 0:21:22.616 Brian Haydin First, I mentioned it, like the open telemetry is becoming the plumbing. Agent framework now explicitly integrates with it, and open AI has been pushing teams towards using it as well. So if you’re building something in a pro code environment, I would say like, just start using the open telemetry framework. It’s, you know, what everybody’s going to be using. 0:21:23.336 –> 0:21:41.776 Brian Haydin and the portability that you’re going to get out of it, it’s going to thank you later. Second, I’ve been using trace-driven evals. So what I’m talking about is moving past the question of does the final answer look good, and two, did the agent actually make 0:21:41.896 –> 0:22:0.536 Brian Haydin good decisions at every step along the way. And the industry right now is kind of converging on evaluating workflows with this trace level detail and not looking at just the final answers that I actually call the steps. Production traces, what I’m talking about with the production traces and the eval data sets. 0:22:0.896 –> 0:22:20.216 Brian Haydin is that your best test cases really come from the production scars. If you think back to that Friday night deployment problem, what did I do to make sure that that didn’t happen again? We’re going to come back to this a little bit when we talk about feedback loops, but your best test cases are absolutely going to come from things that failed in production. 0:22:21.176 –> 0:22:33.96 Brian Haydin Versioning. So I’ve got some things that I want to talk about around like what do you want to capture when things are versioned or what are some of the things that get versioned? Prompts, tools. 0:22:34.696 –> 0:22:57.496 Brian Haydin Internal policies. You know, this is one that I added, you know, for, I really leaned into this talk because if you can’t really diff it at this point, your CICD stuff’s not going to be able to pick up on it and you can’t really debug it. The last one is like making sure that we’ve got the CICD quality gates combined with everything that we’re doing here. 0:22:57.976 –> 0:23:17.176 Brian Haydin And so we can start to detect, my demos show, we can detect that in a CI/CD gate. So check out one of my live versions of this and we can walk through some of the demos. But you’ll see how this actually gets picked up in the CI/CD gate when I have a build failure. I mentioned. 0:23:17.816 –> 0:23:39.176 Brian Haydin evals, and I’m not sure that everybody understands what an eval is. So I’ve got this slide to talk about what are the components that go into it. At a high level, it’s like the unit test for probabilistics, you know, problems like generative AI. And every eval, in my opinion, needs to have at least these six core points. 0:23:39.576 –> 0:23:58.696 Brian Haydin First, what is the scenario and the situation that users that that we’re testing for? What’s the real world setup that actually goes into making this a meaningful, a meaningful eval? Next would be the input. You know, this could be, you know, the prompts, it could be the context or the tool state. 0:23:59.176 –> 0:24:18.136 Brian Haydin and what are the conditions that happened around that run? And I would say it’s not just the input from the standpoint of what the questions somebody typed in, but if I’m making tools, tool calls, like I need to collect those inputs to those tool calls as well. What is the expected behavior? And this is really the kind of… 0:24:18.216 –> 0:24:37.736 Brian Haydin the critical one. It’s not just what’s the right answer. It’s which tool should be called? How should it have been escalated? Should this have been a refused request? And what shape did the output was it supposed to take? Is it a question? Is it JSON? Is it a recommendation versus a classification? 0:24:38.56 –> 0:24:58.376 Brian Haydin You know, these are all different ways that you need to think about that expected behavior. I talk about this assertion type pretty regularly. And there’s a lot of different ways that I’m going to judge, like whether the answer was good or not good. There might be rules-based. It might be some sort of a semantic similarity. 0:24:58.696 –> 0:25:17.536 Brian Haydin I could be using another LLM to judge it. And finally, like, let’s not like discount the fact that some of these just need to have a human that’s going to be reviewing this at some point. So different scenarios are going to have different scoring methods. And that leads you to like, what is the measuring part? What like? 0:25:17.616 –> 0:25:36.536 Brian Haydin What is the pass threshold? Is this a hard 100% rule? Is it a pass fail? Is there a max latency that I need to have in there or some sort of cost constraint that I want to measure against? What’s the similarity score supposed to be? How close to, like, especially when I’m, you know, using RAG that comes into play? 0:25:37.96 –> 0:25:57.336 Brian Haydin And then finally, what’s the impact to the business? What is the risk level that I need to account for? Because not all test cases are really going to be equal. I need, you know, I need to have like some sort of a judgment aspect to that. If I route this to the wrong place, kind of annoying, but that’s going to get it routed back to the right queue. 0:25:57.656 –> 0:26:18.616 Brian Haydin But if I mishandle a safety, you know, kind of scenario, I could have data gets leaked out of my out of my data state, and that could be catastrophic for the organization. So, and I’ll just repeat this because a good eval has all these. A great eval is one that’s based on a real production failure. So 0:26:19.256 –> 0:26:33.576 Brian Haydin Every time that something happens that causes us to come back to the developments, you know, to the developers or a support ticket, ask yourself, is this an opportunity for me to build an eval around this that I can use to capture the results? 0:26:35.496 –> 0:26:42.376 Brian Haydin So we’ve talked about what good ones, what evals are, and what the good ones are, but here’s the thing. 0:26:44.536 –> 0:27:3.496 Brian Haydin None of this can happen if you can’t actually answer one basic question. What changed between the version that you’re working on today and the one that you had deployed yesterday? So this is the versioning slide that I was talking about before, and it’s something that most teams haven’t really thought about yet. So 0:27:3.576 –> 0:27:22.136 Brian Haydin I, you know, I want you to skip, I want you to not skip this so you don’t regret it later. Ask yourself what needs to be versioned. For me, obviously, the first the first one that we’re gonna we’re gonna use is what’s the model version that that we deployed, and I wanna like it’s not just the model number. 0:27:22.616 –> 0:27:43.96 Brian Haydin because the same family model today is not necessarily the same family model that was two months ago. 4.0 or 4.0 mini isn’t last month’s 4.0 mini. And the reason behind that is because the providers have their own system prompts that sit under the hood and those changed. And you can detect those changes, but you have to look for them. 0:27:43.456 –> 0:28:4.856 Brian Haydin and understand it so that you’re pinning and tracking those, you know, as you go along. We also want to track the system prompts, right? Prompts and instructions is the favorite, this is the number one source of behavioral changes. I actually had this one customer that we built, built an agent for them, it was a chat bot that they were using. 0:28:5.256 –> 0:28:23.736 Brian Haydin And as they were like, you know, making, you know, tweaks to like the process or the things like they were, they were testing with it, they’re like, hey, can you give me the ability to like just go and edit the system prompts so I don’t have to call you every time? And I’m like, absolutely not. That’s a terrible idea. 0:28:24.136 –> 0:28:47.656 Brian Haydin Because if you change the word and to or, or if you change it from the to and, like just little tiny like contextual changes, you’re changing that prompt. And I need to have an understanding of what it was, what changed at that point in time. And then tool contracts. So we’re using MCP quite a bit. I mean, I’m sure that a lot of people that are building agents are using MCP all over the place. 0:28:48.16 –> 0:29:8.616 Brian Haydin Those are tools and those tools have contracts. So when those contracts change, you need to understand, you know, that’s something that’s going to change the behavior as well. What evaluator version are you using? I mentioned one of these, one of these possibilities before, which is using LLMs to actually judge and score your evals. 0:29:9.96 –> 0:29:28.96 Brian Haydin Real common practice, and you’ll see in some of my demos in Azure Foundry, that when you run these evals, you actually select the model for the evaluation. That’s going to be a source of the change of behaviors on the eval aspect of it. So make sure that you’re tracking that as well. 0:29:28.776 –> 0:29:47.896 Brian Haydin policies and guardrails. So internally or otherwise, there might be compliance changes or content policy that changes. Maybe there’s a safety classifier on some sort of like, you know, purview classification that or labeling that’s happening. 0:29:48.296 –> 0:30:7.896 Brian Haydin These are all things that like you might not have thought about that you can actually grab and trace and version some of these pieces of information as well. Indexing, embeddings, I know that this is like something that might be a little bit foreign to some people if they don’t work with RAG a lot, but at the end of the day. 0:30:8.376 –> 0:30:27.96 Brian Haydin I’ve got a really good story about this one. We’ve got this manufacturing company customer that has these consumer goods, and they have like 3 main personas of people that would actually look at installation manuals or user guides. The end consumer, obviously, you know, they’re the one aspect of it. 0:30:28.216 –> 0:30:48.136 Brian Haydin you know, maybe contractors or suppliers or another, or maybe architects is another, you know, type of type of persona. And we started with just like 1 aspect. We started with the contractors. They’re the ones that we care about the most. If we get it wrong, they’re the ones that are going to notice. So we wanted to start with that individual. 0:30:48.176 –> 0:31:7.336 Brian Haydin That group of documents, but as we started to layer in the consumer-grade documents into our rag pipelines, the behavior started to change, and once we started to look at it, it was because the information in there was inconsistent, so versioning the data that you’re feeding into your vector indexes. 0:31:7.696 –> 0:31:22.856 Brian Haydin is also as important as the system prompt, because it’s going to change the outputs very dramatically when you start having conflicting data. The agent doesn’t know which one is the most important or the right one, and you have to learn how to adjust your prompts accordingly. 0:31:24.136 –> 0:31:24.616 Brian Haydin So… 0:31:26.216 –> 0:31:45.616 Brian Haydin We built our evals, we versioned things, and now, you know, we also have demonstrated that we can trace the decisions that are making that we’re making in inside of our agents. What’s the next question? How do we actually get this thing in front of real users without losing sleep? 0:31:46.56 –> 0:32:4.776 Brian Haydin So the answer is that you’re going to climb through certain steps. And most of the teams that I work with have like 2 modes, demo and production. That’s it. They go from A to Z, you know, instantly, and they have nothing in between. This slide is really about how you build the things in between. I would start with… 0:32:4.816 –> 0:32:23.656 Brian Haydin like a dev and local testing. Like, hey, it works in my machine. That’s great. Congratulations. That was the easy part. But let’s start moving this into like an offline eval gate. Can you run the data, the golden data sets? Does the agent pass? Like if it doesn’t. 0:32:23.736 –> 0:32:43.896 Brian Haydin we’re not going to go up to the next step. But that next step is a shadow mode, and it’s one that I like A lot. This is one that most teams have been skipping and they haven’t really thought about it, and it can save you a lot of time and effort. If you have an agent that’s running alongside your production system, it’s seeing the inputs, it’s generating real output prompts. 0:32:44.16 –> 0:33:3.216 Brian Haydin someplace, not to the user yet, but it gives you the ability to evaluate the real production scenarios. So nothing’s actually getting executed. You’re just watching all the instruments before you hand over the wheel to an agent. But when you’re ready to hand over that wheel, then you move into a canary mode. 0:33:3.656 –> 0:33:24.456 Brian Haydin So maybe what I’m doing is just taking a small subset of that traffic, one out of every 20, I’m going to route to the agent, have it perform as if it was running autonomously, and I can like look at those, the traffic signals. Is it, you know, if it does make a mistake, the blast radius of those mistakes is really small. 0:33:25.416 –> 0:33:46.56 Brian Haydin And I have a much easier time like controlling that rollback. But if I’ve, you know, done that canary test, things are working really well. Maybe as part of my canary test, I’ve got a human review mode that’s still there. But I definitely, the first iteration of an agent that we deploy here, concurrency, typically has a human in a loop mode first. 0:33:46.456 –> 0:34:9.576 Brian Haydin And, you know, human improves every single one of the actions before we actually pass that off to a tool for automation. And it’s expensive still, right? We’ve built all this automation. We’ve got all the things checked, but I’m still having the person do the work. Why? It’s how we get past that trust gap. It’s how we build confidence. We demonstrate to the users that it’s doing. 0:34:9.856 –> 0:34:29.336 Brian Haydin exactly what they would have done if they were in that situation. Then I can scope out a production deployment. What specific use cases, what specific user groups do I want to roll this out to? Not the full, broad production, but where can I push this out where it’s going to have, where I want to be able to measure and expand that envelope gradually. 0:34:30.536 –> 0:34:51.416 Brian Haydin Finally, once I have that confidence, I can route it to the full traffic. I can have full autonomy. And, you know, you know, but I have to have monitoring along, you know, built in at this point in time. Because at some point in time, something’s going to change and it’s going to start like misbehaving. We have all seen this before. 0:34:51.896 –> 0:35:11.976 Brian Haydin And so that’s where I want to make sure that we like really have that end-to-end definition. What is my rollback trigger? How do I know when and what to rollback? What score drops that trigger? What error rate? What safety flag? Get an agreement on that before you actually get to this point because 0:35:12.216 –> 0:35:30.976 Brian Haydin Friday at 530 when everything’s blown up is not the right time to be having a meeting about what qualifies enough to roll it back. And so the other thing that I wanted to call out is that every stage in this ladder is a point for you to be able to roll back, you know, to, you know, to the beginning or just back. 0:35:31.56 –> 0:35:41.976 Brian Haydin to one more step back. Have that built out as a strategy before you start climbing so that you’re not just hoping along the way that it’s going to all work out. And then. 0:35:43.976 –> 0:35:51.496 Brian Haydin So I want to talk about this intervention ladder as well. Ladders are great analogies, but like 0:35:52.856 –> 0:36:12.216 Brian Haydin Let’s say that things start to like start to fail in production. How are we going to figure out what we’re going to do and how we’re going to control and restrain an agent? And this applies in a lot of scenarios. So here’s the way that I kind of look at it. First level one, something’s looking weird. I’ve got some I’ve got some non-critical evals that are alerting me. 0:36:12.696 –> 0:36:34.56 Brian Haydin that something’s, you know, that something’s amiss. You know, I’m getting notifications, the agent keeps running, and I’m going to start keeping an eye on it. But at some point in time, I’ve noticed a little clustering pattern of some of the problems. And I might want to go into just a review mode. So what I’m doing is I’m saying that the agent can still make recommendations. 0:36:34.336 –> 0:36:54.296 Brian Haydin But at this point in time, I’ve escalated it to a, you know, human to prove any of the actions before it before it executes, not asking for like the human to like make the decisions, just saying, hey, like, do you approve this or not approve? This is great for like customer like customer communications or financial transactions. 0:36:54.936 –> 0:37:16.936 Brian Haydin Things that like, you know, it could be kind of sensitive you want to review to have a human review. Next level up might be like, I’ve noticed that it’s having these specific problems. Maybe the agent can still answer some of the questions automatically and give some of the feedback to the user. But what I don’t want it to do is I don’t want it to write that record to my ERP system. I don’t want to close out that invoice. 0:37:17.296 –> 0:37:36.936 Brian Haydin I don’t want to process this payment. I’m going to hold back on the tool calls. And that’s like the restricted mode. I’m just disabling specific capabilities. Level 4 is disabling the capability entirely. I might be turning off a specific agent or maybe just a specific sub-agent within that agent workflow. 0:37:37.576 –> 0:37:57.656 Brian Haydin because that’s what’s causing the problem. And I want to like do a root cause analysis. And finally, make sure you have a nuclear option, full shutdown, stop the agent, revoke access. Right now, the tools that are coming out are allowing you to do this much more effectively. In Foundry, for example, 0:37:58.56 –> 0:38:19.216 Brian Haydin The control plane actually gives you a centralized life cycle. Be able to see what agents you have in inventory. You can monitor their health. You can perform actions all from that control plane. There’s other ways that you can do this inside of Microsoft’s Agent 365. You started to assign an actual identity to the agent. 0:38:19.536 –> 0:38:36.776 Brian Haydin So you can revoke access just like you would letting an employee go and terminating their access. So, but the key is you need to have all of these controls designed and thought through before you start deploying your agent into production and before you actually need to use them. 0:38:38.536 –> 0:38:57.736 Brian Haydin So kind of getting to the end here a little bit. If you have any other questions, you know, feel free to drop some of them in the chat. But what are some of the things you actually want to watch when your agent goes live? I would say if your only metric is did you get the right answer, you’re really bringing a pool noodle to the Marlin fight. 0:38:58.216 –> 0:39:18.936 Brian Haydin So what do I actually mean? Latency. How long is this agent taking? Are the tool calls slow? Are the models slow? Are you hitting the rate limits? Like what’s happening underneath the scenes? And having latency isn’t necessarily bad. One of the examples that I can give is we were having some challenges with like, 0:39:19.416 –> 0:39:38.296 Brian Haydin how well the agent was reasoning through, you know, the various amounts of data that it had access to. And so we changed from GPT 5.3 to 5.4, and instead of like latency being like 7 to 8 seconds, it started turning into like 20. 0:39:38.336 –> 0:39:57.976 Brian Haydin Thirty seconds really disrupted like the user experience, but we were also measuring things like, you know, what, how was the evaluator like scoring this, and we saw a really huge uptick, so we rolled back temporarily, but we noticed that those. 0:39:58.136 –> 0:40:17.416 Brian Haydin those evaluator averages went back down and we made the conscious decision to go back into the 5.4 model because the trade-off was actually worth it. We got better, we got better answers. The users, you know, we just fixed like what the user saw from a feedback mechanism so that they didn’t feel like they were waiting forever. 0:40:17.976 –> 0:40:36.456 Brian Haydin But like, you know, we got the right, you know, we got the right output. What about quality scores? We talked about that, safety flags. That’s a big one. Make sure that you’re putting some of these things in place. But not all of these are going to like, you know, apply to you and your organization. 0:40:37.416 –> 0:40:51.496 Brian Haydin But what I would do is like pick three or four of these that matter and start watching them, not like boil the ocean and say, I’m going to capture everything and build a dashboard that has, you know, 100 different metrics that I have to score. So. 0:40:53.416 –> 0:41:13.576 Brian Haydin Bring this home. What are you going to do tomorrow or later today if you’re really ambitious? I would say, here’s your starter kit. Think about, you know, think about this as just like a pre-departure checklist. First one, add tracing. Go back, implement open telemetry if you can. 0:41:13.936 –> 0:41:37.176 Brian Haydin It doesn’t have to be fancy. Just wire up the harness and start capturing something. You know, that’s going to give you a little bit of a flavor of how to apply this into your projects going forward. Capture the tool call specifically looking at inputs and outputs. Every parameter, every parameter that you can, start to incorporate that into your evals. And then 0:41:37.336 –> 0:41:57.376 Brian Haydin I would define, you know, somewhere around 5 to 10, you know, golden prompts and expected behaviors. Don’t boil the ocean. Don’t say, I’m going to implement 100 different evals. You’re going to get stuck. And definitely don’t say, well, we’re just going to start with one. Go back to your product team, ask them what are five, you know, 5 to 10. 0:41:57.456 –> 0:42:17.816 Brian Haydin like critical things that you need, critical test cases, and use that as your baseline moving forward. And add just one regression check to your CI/CD pipeline. Make sure that you have something tight that if it fails, the build’s going to fail. Demonstrate that you can actually stop it going into production and prove that that pattern is working first. 0:42:18.536 –> 0:42:37.136 Brian Haydin Step 5, I would choose one risky action that requires some sort of a human review. Maybe that is sending an email. Maybe it’s sending an update on the sales order. Maybe it’s sending an invoice out. Whatever scares you the most, put a human in the loop for that at first. And then 0:42:38.136 –> 0:42:56.56 Brian Haydin Definitely define that kill switch playbook that I talked about. Document how you’re going to move the agent into a review mode, how you want to disable it, and who has the authority to do it. Don’t wait until things have started to fail until you have that mapped out. That needs to start at the beginning. And then finally, monitor the costs. 0:42:56.456 –> 0:43:15.496 Brian Haydin Agents, and this is, I have a lot more to say about this later, not today, but like in some of my like newsletter posts. The economics is changing. I’m noticing some of the tools that I use on a regular basis have completely changed their token consumption, their cost models. 0:43:16.56 –> 0:43:36.456 Brian Haydin everybody’s going to start doing it. The end of subsidized AI is over, and we’re going to all start using consumption as the metric, and it’s going to get expensive, and it’s going to get expensive fast. And that includes these models that are being hosted as well. So start putting in the token consumption, start measuring some of the costs. 0:43:37.16 –> 0:43:56.136 Brian Haydin the stuff can get out of hand really, really quick. And so these are basically my steps to success. And with that, I will say, you know, if there’s any questions, drop them in the chat. I’ll stay on here for a little bit. But if you like this content, if you’d like, you know, 0:43:57.496 –> 0:44:15.656 Brian Haydin Follow me on LinkedIn, follow Concurrency, and Amy’s dropped a little bit of a form in here as well. Fill that out. Give me some feedback. Did you, you know, did this resonate with you and your team? Did you want to set up 20 minutes to meet with me? You know, 0:44:15.976 –> 0:44:22.856 Brian Haydin fill out that form, or if you connect with me on LinkedIn, I’m happy to meet with you. With that being said, thanks for joining me today.