Letting AI agents loose on Examine

I've been maintaining Examine for a very long time now. It's the search and indexing library that sits underneath every Umbraco site, and like any project that's been around long enough, it has plenty of hot paths that could be faster, a steady trickle of issues that need triaging, and a backlog of "I'll get to that eventually" performance work that, well… I never actually get to.

So a couple of months ago I thought I'd try an experiment - what if I stopped doing all of that myself and instead let a bunch of AI agents do it for me, on a schedule, while I got on with everything else? This post is about how that went. I've shipped a few releases off the back of it and some of the hot-path improvements are genuinely a bit ridiculous.

So what are these "agentic workflows"?

The thing I installed is GitHub Next's Agentic Workflows - a set of markdown-defined workflows that run in GitHub Actions and drive an AI agent to do actual maintenance work on your repo. They're not chat bots. They run on a schedule, they have persistent memory, they open real pull requests, and they leave the "should this ship?" decision to me.

I ended up with a few of them running on Examine:

Perf Improver - runs daily, hunts for performance bottlenecks, writes benchmarks to actually prove the improvement, and opens a draft PR with the before/after numbers.
Efficiency Improver - its scrappier sibling, focused on the smaller allocation-and-LINQ-state-machine wins that add up over time.
Daily Issue Triage - goes through untriaged issues, sets types, applies labels, spots duplicates, and leaves a tidy triage report for me.
Agentic Maintenance - keeps the whole setup ticking along.

Each one is just a markdown file in .github/workflows/ with a description, a schedule, some safe-outputs limits (things like "you may open at most 4 PRs per run, and they must be drafts") and a big prompt describing how to behave. That's it. The nice part is that the guardrails are declarative, so the agent can't merge its own PRs, it can't touch protected files, and it can only comment so many times per run. All of that is baked in.

How it actually works day to day

The bit that makes this more than a gimmick is the persistent memory. Every run, the Perf Improver reads its own notes - which build/test/benchmark commands it validated, what's on its optimisation backlog, what it worked on last time, and which suggestions I've already ticked off. Then it does a couple of tasks in a round-robin fashion so it's not endlessly poking at the same corner of the codebase.

It also keeps a single rolling "Monthly Activity" issue open with a checklist of what needs my attention. So my side of it is pretty simple: the agent opens a draft PR with measured before/after numbers, I read it and run CI, and if I'm happy with it I merge. That's the whole loop. I'm the reviewer, the agent is the workhorse doing the grind I never had time for.

So what did they actually get done?

Over roughly the last two months, here's what these workflows actually got up to on Examine:

124 successful Perf Improver runs and 9 successful Efficiency Improver runs
20 successful Daily Issue Triage runs quietly keeping the issue tracker tidy
24 performance/efficiency PRs reviewed and merged (13 from Perf Improver, 11 from Efficiency Improver) between late May and the end of June
And the bit that actually matters - three releases shipped off this work: v3.8.0, v3.9.0 and the v4.0.0-beta.7 pre-release

That last point is really the whole thing. This isn't a pile of speculative branches rotting in a fork somewhere - it's code that went through my review, passed CI, and is now sitting in NuGet packages that real Umbraco sites are running.

Show me the numbers

Right, this is the part I actually get excited about. One of my favourite things about the Perf Improver is that its prompt tells it to only attempt improvements it can actually measure - establish a baseline first, make the change, then measure again and document both numbers. For the algorithmic hot-path stuff that means benchmarks, so along the way it built out a proper BenchmarkDotNet suite that compares the current source against the published NuGet packages (3.0.1 through 3.3.0), which means I can show you real, reproducible deltas instead of hand waving.

The full-text search hot path

ManagedQuery is the primary full-text search entry point in Examine - it's what runs on basically every search. The agent noticed it had no dedicated benchmark, wrote one, and then stacked up a series of small, individually measured changes: a volatile factory cache in SearchContext.GetFieldValueType, an early return in the extract-terms check, and killing off some redundant ConcurrentDictionary lookups in AddDocument.

Here's the current source vs the most recent 3.3.0 release, on a 1,000-document index:

Method	Version	Mean	Allocated
ManagedQueryAllFields	3.3.0	11.42 ms	1,323 KB
ManagedQueryAllFields	Source	2.17 ms	371 KB

That's about 5.3x faster and roughly 3.6x less memory allocated on the single most travelled code path in the whole library. On the busiest thing Examine does. I'll happily take that.

Building queries

GroupedAnd / GroupedOr / GroupedNot are the workhorses of the query builder. A little string[] fast-path (skipping a defensive .ToArray() copy when the caller already handed it a string[]) plus some allocation trimming got this:

Method	Version	Mean	Allocated
CreateQueryOnly	3.3.0	3,995 ns	8.34 KB
CreateQueryOnly	Source	319 ns	2.20 KB
GroupedAndStringArray	3.3.0	21,377 ns	21.10 KB
GroupedAndStringArray	Source	16,659 ns	14.34 KB

The CreateQuery() baseline dropping from 8.34 KB down to 2.2 KB is a lovely little win, and the grouped clauses are about 25% faster with a third of the allocation shaved off.

Constructing a ValueSet

Every single document you index goes through a ValueSet constructor. The old path allocated an intermediate dictionary and a generator state machine per field, which is exactly the kind of thing you don't notice until you're bulk indexing a big site. The agent got rid of both:

Method	Version	Mean	Allocated
FromDictionary5Fields	3.3.0	1,183 ns	2,200 B
FromDictionary5Fields	Source	226 ns	592 B
FromDictionary20Fields	3.3.0	4,007 ns	6,544 B
FromDictionary20Fields	Source	654 ns	1,520 B

That's roughly 5-6x faster and about 4x less allocated on indexing, and when you're rebuilding the index on a large Umbraco site that adds up fast.

The really nice touch is that these benchmark result tables now live in <remarks> doc-comments right next to the benchmark code, so the numbers are versioned in the repo alongside the thing they measure. The agent did that bit too.

Was it actually worth it?

For me the thing that makes it work is the measurement. Everything comes to me as a small, focused, draft PR with the numbers attached, so I can look at it, sanity check it, run CI, and decide in a couple of minutes. It's not zero effort - I still read every change before it goes anywhere near main - but the ratio is fantastic. I'm getting a steady stream of well-measured, single-purpose performance PRs on a library I care about, on paths I'd genuinely never have found the time to optimise by hand, and I've shipped real releases because of it. For a project I maintain around everything else, that's a pretty great deal.

If you maintain a repo with a backlog you never get to, especially performance work that needs benchmarks to justify it, I'd recommend giving GitHub Next's agentics a go.

You can see all of it out in the open on the Examine repo - the [perf-improver] and [efficiency-improver] PRs, the benchmark suite, and the releases they fed into. And there's a nice bonus here for me too: ExamineX, my managed, cloud-hosted Examine search offering, runs on this exact same internal plumbing. So all this work the agents have been doing to tighten up Examine's query and indexing hot paths feeds straight through into ExamineX - the underlying engine gets faster and leaner, and every ExamineX site gets those wins for free without changing a thing. If you'd rather have your search running as a managed service instead of hosting Lucene indexes on your own servers, that's what ExamineX is there for. 🙂

Shazwazza

My blog which is pretty much just all about coding