Letting AI agents loose on Examine
I've been maintaining Examine for a very long time now. It's the search and indexing library that sits underneath every Umbraco site, and like any project that's been around long enough, it has plenty of hot paths that could be faster, a steady trickle of issues that need triaging, and a backlog of "I'll get to that eventually" performance work that, well… I never actually get to.
So a couple of months ago I thought I'd try an experiment - what if I stopped doing all of that myself and instead let a bunch of AI agents do it for me, on a schedule, while I got on with everything else? This post is about how that went, and honestly it went a fair bit better than I expected. We've shipped a few releases off the back of it and some of the hot-path improvements are genuinely a bit ridiculous.
So what are these "agentic workflows"?
The thing I installed is GitHub Next's Agentic Workflows - a set of markdown-defined workflows that run in GitHub Actions and drive an AI agent to do actual maintenance work on your repo. They're not chat bots. They run on a schedule, they have persistent memory, they open real pull requests, and they leave the "should this ship?" decision to me.
I ended up with a few of them running on Examine:
- Perf Improver - runs daily, hunts for performance bottlenecks, writes benchmarks to actually prove the improvement, and opens a draft PR with the before/after numbers.
- Efficiency Improver - its scrappier sibling, focused on the smaller allocation-and-LINQ-state-machine wins that add up over time.
- Daily Issue Triage - goes through untriaged issues, sets types, applies labels, spots duplicates, and leaves a tidy triage report for me.
- Agentic Maintenance - keeps the whole setup ticking along.
Each one is just a markdown file in .github/workflows/ with a description, a schedule, some safe-outputs limits (things like "you may open at most 4 PRs per run, and they must be drafts") and a big prompt describing how to behave. That's it. The nice part is that the guardrails are declarative, so the agent can't merge its own PRs, it can't touch protected files, and it can only comment so many times per run. All of that is baked in.
How it actually works day to day
The bit that makes this more than a gimmick is the persistent memory. Every run, the Perf Improver reads its own notes - which build/test/benchmark commands it validated, what's on its optimisation backlog, what it worked on last time, and which suggestions I've already ticked off. Then it does a couple of tasks in a round-robin fashion so it's not endlessly poking at the same corner of the codebase.
It also keeps a single rolling "Monthly Activity" issue open with a checklist of what needs my attention. So my side of it is pretty simple: the agent opens a draft PR with measured before/after numbers, I read it and run CI, and if I'm happy with it I merge. If I'm not, I close it and it learns for next time. That's the whole loop. I'm the reviewer, the agent is the workhorse doing the grind I never had time for.
So what did they actually get done?
Over roughly the last two months, here's what these workflows actually got up to on Examine:
- 124 successful Perf Improver runs and 9 successful Efficiency Improver runs
- 20 successful Daily Issue Triage runs quietly keeping the issue tracker tidy
- 24 performance/efficiency PRs reviewed and merged (13 from Perf Improver, 11 from Efficiency Improver) between late May and the end of June
- Another 13 bug-fix PRs from the Copilot coding agent (more on those below)
- And the bit that actually matters - three releases shipped off this work: v3.8.0, v3.9.0 and the v4.0.0-beta.7 pre-release
That last point is really the whole thing. This isn't a pile of speculative branches rotting in a fork somewhere. This is code that went through my review, passed CI, and is now sitting in NuGet packages that real Umbraco sites are running.
Show me the numbers
Right, this is the part I actually get excited about. One of my favourite things about the Perf Improver is that it isn't allowed to just claim an improvement - it's flat out forbidden from doing that. Every change has to come with a benchmark. So along the way it built out a proper BenchmarkDotNet suite that compares the current source against the published NuGet packages (3.0.1 through 3.3.0), which means I can show you real, reproducible deltas instead of hand waving.
The full-text search hot path
ManagedQuery is the primary full-text search entry point in Examine - it's what runs on basically every search. The agent noticed it had no dedicated benchmark, wrote one, and then we stacked up a series of small, individually measured changes: a volatile factory cache in SearchContext.GetFieldValueType, an early return in the extract-terms check, and killing off some redundant ConcurrentDictionary lookups in AddDocument.
Here's the current source vs the most recent 3.3.0 release, on a 1,000-document index:
| Method | Version | Mean | Allocated | |---|---|---:|---:| | ManagedQueryAllFields | 3.3.0 | 11.42 ms | 1,323 KB | | ManagedQueryAllFields | Source | 2.17 ms | 371 KB |
That's about 5.3x faster and roughly 3.6x less memory allocated on the single most travelled code path in the whole library. On the busiest thing Examine does. I'll happily take that.
Building queries
GroupedAnd / GroupedOr / GroupedNot are the workhorses of the query builder. A little string[] fast-path (skipping a defensive .ToArray() copy when the caller already handed us a string[]) plus some allocation trimming got us this:
| Method | Version | Mean | Allocated | |---|---|---:|---:| | CreateQueryOnly | 3.3.0 | 3,995 ns | 8.34 KB | | CreateQueryOnly | Source | 319 ns | 2.20 KB | | GroupedAndStringArray | 3.3.0 | 21,377 ns | 21.10 KB | | GroupedAndStringArray | Source | 16,659 ns | 14.34 KB |
The CreateQuery() baseline dropping from 8.34 KB down to 2.2 KB is a lovely little win, and the grouped clauses are about 25% faster with a third of the allocation shaved off.
Constructing a ValueSet
Every single document you index goes through a ValueSet constructor. The old path allocated an intermediate dictionary and a generator state machine per field, which is exactly the kind of thing you don't notice until you're bulk indexing a big site. The agent got rid of both:
| Method | Version | Mean | Allocated | |---|---|---:|---:| | FromDictionary5Fields | 3.3.0 | 1,183 ns | 2,200 B | | FromDictionary5Fields | Source | 226 ns | 592 B | | FromDictionary20Fields | 3.3.0 | 4,007 ns | 6,544 B | | FromDictionary20Fields | Source | 654 ns | 1,520 B |
That's roughly 5-6x faster and about 4x less allocated on indexing, and when you're rebuilding the index on a large Umbraco site that adds up fast.
The really nice touch is that these benchmark result tables now live in <remarks> doc-comments right next to the benchmark code, so the numbers are versioned in the repo alongside the thing they measure. The agent did that bit too.
It's not just perf, it fixes bugs as well
Alongside the scheduled perf work, I leaned on the Copilot coding agent for the gnarlier stuff - the kind of bug that needs actual investigation. A big one was the SyncedFileSystemDirectoryFactory (the default directory factory for Umbraco) getting into a permanent index outage when a temp index file was transiently locked, along with taxonomy replication getting out of sync. Those are exactly the sort of intermittent, hard-to-reproduce issues that can eat a whole weekend. The agent investigated, reproduced it, and proposed fixes that made the searcher lazily resettable and hardened the replication, and those shipped in the v3.x and v4 beta releases.
Was it actually worth it?
Honestly, yes. And I want to be straight about it, because there's a lot of AI hype flying around at the moment and most of it deserves a healthy dose of scepticism.
The bit that sold me wasn't that the agents are magic, because they're not. It's the measurement. Everything comes to me as a small, focused, draft PR with the numbers attached, so I can look at it, sanity check it, and decide in a couple of minutes. When the agent isn't sure about something the prompt tells it to do nothing rather than post noise, and for the most part it actually behaves. The stuff that isn't good enough I just close, and it moves on.
It's not zero effort - I still read every change and I've closed plenty that weren't worth it. But the ratio is fantastic. I'm getting a steady stream of well-measured, single-purpose performance PRs on a library I care about, on paths I'd genuinely never have found the time to optimise by hand, and I've shipped real releases because of it. For a project that I maintain around everything else, that's a pretty great deal.
If you maintain a repo with a backlog you never get to, especially performance work that needs benchmarks to justify it, I'd genuinely recommend giving GitHub Next's agentics a go.
You can see all of it out in the open on the Examine repo - the [perf-improver] and [efficiency-improver] PRs, the benchmark suite, and the releases they fed into. And there's a nice bonus here for me too: ExamineX, my managed, cloud-hosted Examine search offering, runs on this exact same internal plumbing. So all of this work the agents have been doing to tighten up Examine's query and indexing hot paths feeds straight through into ExamineX as well - the underlying engine gets faster and leaner, and every ExamineX site gets those wins for free without changing a thing. If you'd rather have that search running as a managed service instead of hosting Lucene indexes on your own servers, that's what ExamineX is there for. 🙂