Examine and Azure Blob Storage

Examine and Azure Blob Storage

Quite some time ago - probably close to 2 years - I created an alpha version of an extension library to Examine to allow storing Lucene indexes in Blob Storage called Examine.AzureDirectory. This idea isn’t new at all and in fact there’s been a library to do this for many years called AzureDirectory but it previously had issues and it wasn’t clear on exactly what it’s limitations are. The Examine.AzureDirectory implementation was built using a lot of the original code of AzureDirectory but has a bunch of fixes (which I contributed back to the project) and different ways of working with the data. Also since Examine 0.1.90 still worked with lucene 2.x, this also made this compatible with the older Lucene version.

… And 2 years later, I’ve actually released a real version

Why is this needed?

There’s a couple reasons – firstly Azure web apps storage run on a network share and Lucene absolutely does not like it’s files hosted on a network share, this will bring all sorts of strange performance issues among other things. The way AzureDirectory works is to store the ‘master’ index in Blob Storage and then sync the required Lucene files to the local ‘fast drive’. In Azure web apps there’s 2x drives: ‘slow drive’ (the network share) and the ‘fast drive’ which is the local server’s temp files on local storage with limited space. By syncing the Lucene files to the local fast drive it means that Lucene is no longer operating over a network share. When writes occur, it writes back to the local fast drive and then pushes those changes back to the master index in Blob Storage. This isn’t the only way to overcome this limitation of Lucene, in fact Examine has shipped a work around for many years which uses something called SyncDirectory which does more or less the same thing but instead of storing the master index in Blob Storage, the master index is just stored on the ‘slow drive’.  Someone has actually taken this code and made a separate standalone project with this logic called SyncDirectory which is pretty cool!

Load balancing/Scaling

There’s a couple of ways to work around the network share storage in Azure web apps (as above), but in my opinion the main reason why this is important is for load balancing and being able to scale out. Since Lucene doesn’t work well over a network share, it means that Lucene files must exist local to the process it’s running in. That means that when you are load balancing or scaling out, each server that is handling requests will have it’s own local Lucene index. So what happens when you scale out further and another new worker goes online? This really depending on the hosting application… for example in Umbraco, this would mean that the new worker will create it’s own local indexes by rebuilding the indexes from the source data (i.e. database). This isn’t an ideal scenario especially in Umbraco v7 where requests won’t be served until the index is built and ready. A better scenario is that the new worker comes online and then syncs an existing index from master storage that is shared between all workers …. yes! like Blob Storage.

Read/Write vs Read only

Lucene can’t be written to concurrently by multiple processes. There are some workarounds here a there to try to achieve this by synchronizing processes with named mutex/semaphore locks and even AzureSearch tries to handle some of this by utilizing Blob Storage leases but it’s not a seamless experience. This is one of the reasons why Umbraco requires a ‘master’ web app for writing and a separate web app for scaling which guarantees that only one process writes to the indexes. This is the setup that Examine.AzureDirectory supports too and on the front-end/replica/slave web app that scales you will configure the provider to be readonly which guarantees it will never try to write back to the (probably locked) Blob Storage.

With this in place, when a new front-end worker goes online it doesn’t need to rebuild it’s own local indexes, it will just check if indexes exist and to do that will make sure the master index is there and then continue booting. At this stage there’s actually almost no performance overhead. Nothing actually happens with the local indexes until the index is referenced by this worker and when that happens Examine will lazily just sync the Lucene files that it needs locally.

How do I get it?

First thing to point out is that this first release is only for Examine 0.1.90 which is for Umbraco v7. Support for Examine 1.x and Umbraco 8.x will come out very soon with some slightly different install instructions.

The release notes of this are here, the install docs are here, and the Nuget package for this can be found here.

PM> Install-Package Examine.AzureDirectory -Version 0.1.90

To activate it, you need to add these settings to your web.config

<add key="examine:AzureStorageConnString" value="YOUR-STORAGE-CONNECTION-STRING" />
<add key="examine:AzureStorageContainer" value="YOUR-CONTAINER-NAME" />

Then for your master server/web app you’ll want to add a directoryFactory attribute to each of your indexers in ExamineSettings.config, for example:

<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
      supportUnpublished="true"
      supportProtected="true"
      directoryFactory="Examine.AzureDirectory.AzureDirectoryFactory, Examine.AzureDirectory"
      analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

For your front-end/replicate/slave server you’ll want a different readonly value for the directoryFactory like:

<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
      supportUnpublished="true"
      supportProtected="true"
      directoryFactory="Examine.AzureDirectory.ReadOnlyAzureDirectoryFactory, Examine.AzureDirectory"
      analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

Does it work?

Great question :) With the testing that I’ve done it works and I’ve had this running on this site for all of last year without issue but I haven’t rigorously tested this at scale with high traffic sites, etc… I’ve decided to release a real version of this because having this as an alpha/proof of concept means that nobody will test or use it. So now hopefully a few of you will give this a whirl and let everyone know how it goes. Any bugs can be submitted to the Examine repo.

 

 

Author

Shannon Thompson

I'm a Senior Software Engineer working full time at Microsoft. Previously, I was working at Umbraco HQ for about 10 years. I maintain several open source projects (many related to Umbraco) such as Articulate, Examine and Smidge, and I also have a commercial software offering called ExamineX. Welcome to my blog :)

comments powered by Disqus