Insights Lightning Garbage and the Journey to Build Agentic Search in Production

Lightning Garbage and the Journey to Build Agentic Search in Production

Enterprise search is broken. Search engines are great at giving you what you ask for, but terrible at giving you what you want. One of my favorite customers runs a content management platform called Engineering Unleashed that helps engineering professors create and find education content for their students. Part content discovery, part social network, part curriculum management, their platform connects people to content and people to people.

But their users had a problem that we’ve all had:

They were overwhelmed with search results.

The search experience was familiar, fast, and brittle. It relied on keyword matching and a labyrinth of over 20 search criteria for users to hone queries. In theory this enabled users to find really specific content. But in practice, it flooded users with irrelevant results, really fast.

Lightning garbage.

Their users worked around lightning garbage the way we’ve all learned to when using advertising-funded search engines: skim and click relevant-looking links until you find what you want. Or revise the search using caveman language if the computer is confused by what we mean.

Our customer wanted more for his users.

They needed a search engine that could infer what users want and find the relevant content, even when users aren’t very clear about what exactly they want.

So they called us.

Our Challenge:

  • Make time-to-discovery faster
  • Reduce manual effort
  • Return results based on user intent, not keywords and Boolean flags
  • Can’t be too slow
  • Scale with more users and content
  • Oh, and make it “agentic”

Standard search methods, like semantic indexing and inverted indexes from textbook information retrieval, are well-known and can handle all of this. Except that last requirement. What does it mean to make search agentic?

That’s a good question. I want to share my answer and the lessons my team learned when we built a production grade agentic search engine to help thousands of users find what they’re looking for.

What We Mean When We Talk About Agentic Behavior

With no one agreeing on what an agent is, I want to get really clear about this topic. Why? Because foggy definitions of what we’re building is the clearest way to light a pile of money on fire. There are some dimensions, like autonomy and goal-seeking, that the industry is converging on to characterize agents. But I think it’s probably smarter to think about systems in terms of degrees of agentic behavior vs whether it’s an agent or not.

A simple but robust way to describe an agentic system is:

The degree to which a system uses an LLM to decide its control flow.[1]

There are two important observations that have practical implications I want to highlight:

  1. Agentic behavior is a spectrum and where a system lies on it matters for business stakeholders. This may seem like irrelevant technobabble, but not explicitly calling out how agentic a system needs to be in design and implementation is the difference between a solution that works and one that almost worked. More on that later. There are non-agentic systems—think legacy if-then software—and maximally agentic systems. But between those extremes are degrees of agentic systems. If only some of an application’s control flow is decided by an LLM—to route user intent, for example—it’s less agentic than a system with a lot of its control flow handled by an LLM. Either way, it’s agentic.
  2. Using an LLM is necessary but insufficient to make a system agentic. For example, an application may use a simple, single-shot prompt of an LLM to do something like generate text used somewhere in the application. But if the LLM is not used to decide control flow, it’s not an agentic application. The LLM needs to be used to decide control flow to be differentiated from non-agentic software. This is a difference with implications for the business (risk, user satisfaction) and engineering teams (unit testing, guardrails, maintenance), which is why I highlight it.

Here are concrete examples of how the same feature can be agentic or not.

It’s worth asking if we even need the concept of agentic. Is it worth our time? What does it help our customer and their users with?

I think, yes, absolutely, it’s a concept worth dwelling on. It’s critical to enterprise product engineering in four ways: designing it, building it, funding it, and setting expectations with users about it.

  1. Designing: the more agentic your system is, the more the architecture will need an orchestration framework, asynchronous calls to manage the communication overhead, and the more the design will need the right abstractions to manage complexity.
  2. Building: the more agentic your system is, the more you won’t know what’s happening inside it, so you’ll want to monitor, evaluate, and phase gate internal steps.
  3. Funding: the more agentic your system is, the more engineering time that’s needed for this orchestration, abstraction, and observability work.
  4. Expectations: the more agentic your system is, the more you’ll need to set expectations with users about its capabilities and limitations.

Let’s get practical. How does this fit with what our customer wanted? Let’s walk through agentic search, why it makes things better, and how that helps the business. In their visions of agentic search, they want the AI to:

  1. Analyze a user query: “hands on curriculum for 2nd year fluid mechanics students”
  2. Understand user intent: “I want to connect with other professors who have experience with fluid mechanics labs, so I can gather common challenges before designing my own”
  3. And to respond with either clarifying questions: “Are you interested in examples of curriculums to design your own or are you interested in learning from others who have?”
  4. Or respond with relevant search results: people or content
  5. And provide a brief explanation of why each item it found is relevant to the user’s query and adapt its future responses to thumbs-up/down feedback on the prior results.

This list of requirements makes this agentic search because what the app does depends on the LLM inferring user intent. We knew this agentic capability would require new tooling that has to play nicely with the existing architecture or an architecture redesign is in order.

The Legacy Architecture

The legacy system was a Blazor app written in C# and utilized Azure Search as its core search service. It was designed to handle keyword-based searches and was connected to several stateful content types that can be returned to users: Cards, Users, Groups, Topics, and Articles. The content was stored across Azure Blob Storage and Azure SQL Database.

In this legacy app, an Azure Function ran every 15 minutes to collect and index the latest content from various storage locations into Azure Search. Here’s how it worked:

Abstracting away the details, it’s a standard search engine pattern: a user enters a query, optionally using advanced search filters (over 20 options), allowing them to refine their search by attributes like entity type, domain, and institution. When they submit their search query, the Blazor application forwards the query to Azure Search, where the system uses the previously indexed data to find matching content. The search results are then ranked using the system’s token analyzer, which attempts to prioritize the top 20 most relevant items. But it was never fully optimized, so the results are often less relevant than what the user wanted.

The main question for us was where to make this system agentic. The existing system resembled a basic information retrieval system with four standard components, which I isolate in the diagram below.

How can we combine both pictures, so it looks more like this?

We learned a lot of lessons figuring that out:

Lesson 1: More Agents = More Problems

Our first instinct was to build a multi-agent architecture. Each agent would handle one of the four specific information retrieval components above: one to rewrite the user query, one to get documents, one to re-rank, and one to interpret user intent. We used Autogen’s GroupChatManager to coordinate conversations among five agents. It looked good on paper. But in practice, latency exploded to 20+ seconds per query, debugging was a nightmare, and the system struggled to perform under load. Using Autogen also created compatibility and bound issues, which we didn’t expect. For example, each agent was defined as a Python class with methods to handle agent specific tasks. Autogen’s framework expected these methods to behave like standalone functions (unbound functions) but struggled because bound methods include metadata related to their parent class or instance. For example, if Autogen tried to call a method or extract f._name from a bound function, it would fail because bound methods don’t have the expected attributes directly accessible.

So we migrated off the multi-agent architecture, which allowed the team to shift to a functional programming style, where standalone functions (not tied to a class or instance) were used. This eliminated the issue because standalone functions are simpler and don’t have the extra layers of context added by class bindings.

In all, we reduced the number of agents from 4 to 1, relied on one-shot prompting, and made things deterministic where possible by using Pydantic and Instructor. This simplified the architecture dramatically. Instead of having separate agents in the query re-writer component for query mapping and refinement, we used Pydantic and Instructor to handle both tasks in a single step. Retrieval and re-ranking were combined into one process as well. This eliminated extra steps and reduced latency.

Another simplification was removing middleware. We used Azure App Service and a simple FastAPI implementation. To make workload balancing easier, we used a simple hash-key approach where each user session was assigned to one of four specific GPT-4o model instances based on their user ID. This ensured that one model would handle the entire session and concurrent users wouldn’t have a slow experience. This kept things predictable to maintain and efficient for users.

Takeaway: More agents don’t mean better; it means slower and harder to maintain. Every agent you add should justify its existence by solving a problem that can’t be handled more efficiently or more deterministically another way. At the time of this writing, the tech hasn’t been around long enough for best practice patterns to emerge and be adopted.


Lesson 2: Your Index is the MVP

Traditional search systems optimize their indexes for speed and relevance. Most modern search engines have been solving for relevance by capturing semantic information with large pre-trained models like BERT and Word2Vec for years now. But our customer wasn’t. So our first order of business was indexing content as vector embeddings to capture semantics. Without anything else, this would improve search relevance. But our agentic search needs go beyond this still. We need the index to support dynamic interactions, real-time reranking, and content refinement. While investigating our options, we discovered that the original indexing strategy would need to be scrapped altogether. The Azure Search Index stored everything as nested entities, such as Cards, Users, Groups, Topics, and Articles. While this lets the system quickly pull results based on keywords, it can’t understand the relationship between a Card and its associated Topics or Users. The Azure Search doesn’t allow vector fields for embeddings in nested data structures, so we had to rethink data management for storage and retrieval.

The fix required decoupling the nested indexes so each content type like Cards and Users had its own top-level index. Each index could now support semantic search and relevance scoring because the indexes had a vector field.

This adjustment also simplified our Azure Function that ran every 15 minutes to re-index data. The new indexing process is designed around an incremental pipeline, utilizing Python-based Azure Functions to handle:

  • Content Extraction: A .NET indexer pulls data from SQL databases and Azure Blob Storage, similar to the legacy system.
  • Vectorization: Instead of just tokenizing content, the new system calls a Python function to vectorize queries and content. We evaluated a number of open source foundation models to optimize for cost and quality, but text-embedding-3-large ended up as best.
  • Indexing: Embeddings and summaries are stored in specialized indexes for Cards, Users, and more.

Takeaway: Your index is more than a lookup tool; it’s the backbone of context-aware search. Unnesting and decoupling indexes optimize extensibility, so in the future you can add agents to handle specialized tasks, while not prematurely overoptimizing now. Including embedding vectors positions your system to improve as embedding models improve.


Lesson 3: Optimize Coordination, not Capacity

In traditional information retrieval systems, scaling is fairly straightforward: add more compute or pick your favorite partitioning strategy, and you’re good to go. In agentic search, it’s more complicated because you need to consider how they’ll coordinate under load. Since we pivoted away from multi-agent architecture, I’ll leave this lesson high level:

Takeaway: Manage relationships, not just capacity. Scaling agentic search is like scaling a team; adding more resources isn’t enough. You have to make agents are working together efficiently under load.

Think about the most frustrating thing about search for you. I bet it’s fundamentally about bad search results. Search engines that don’t find very well need to get lost in my opinion.  That’s why enterprise search is broken. It’s one of the last places where people still tolerate tools that don’t work the way they should, and that tolerance is running out. Technology has always moved fast but with LLMs, the message is amplified: adapt or your users will leave you behind.

Yes, agentic search won’t fix everything. But it’s a powerful framework to begin with. My recommendation is to think big but start small. Begin with a focused MVP, like adding an embedding index. This will improve document retrieval and re-reranking without UI changes. Once that’s live, you can layer in the advanced components like AI rewriters and readers.

If you’re building something in this space, or want to chat about it, I’d love to talk. Dm me or shoot me an email at: bdey@concurrency.com.


[1] Credit to Harrison Chase at LangChain for inspiration.