Shazwazza

Can I disable Examine indexes on Umbraco front-end servers?

Thu, 23 Mar 2023 15:10:30 Z

In Umbraco v8, Examine and Lucene are only used for the back office searches, unless you specifically use those APIs for your front-end pages. I recently had a request to know if it’s possible to disable Examine/Lucene for front-end servers since they didn’t use Examine/Lucene APIs at all on their front-end pages… here’s the answer

Why would you want this?

If you are running a Load Balancing setup in Azure App Service then you have the option to scale out (and perhaps you do!). In this case, you need to have the Examine configuration option of:

<add key="Umbraco.Examine.LuceneDirectoryFactory" 
          value="Examine.LuceneEngine.Directories.TempEnvDirectoryFactory, Examine" />

This is because each scaled out worker is running from the same network share file system. Without this setting (or with the SyncTempEnvDirectoryFactory setting) it would mean that each worker will be trying to write Lucene file based indexes to the same location which will result in corrupt indexes and locked files. Using the TempEnvDirectoryFactory means that the indexes will only be stored on the worker's local 'fast drive' which is in it's %temp% folder on the local (non-network share) hard disk.

When a site is moved or provisioned on a new worker the local %temp% location will be empty so Lucene indexes will be rebuilt on startup for that worker. This will occur when Azure moves a site or when a new worker comes online from a scale out action. When indexes are rebuilt, the worker will query the database for the data and depending on how much data you have in your Umbraco installation, this could take a few minutes which can be problematic. Why? Because Umbraco v8 uses distributed SQL locks to ensure data integrity and during these queries a content lock will be created which means other back office operations on content will need to wait. This can end up with SQL Lock timeout issues. An important thing to realize is that these rebuild queries will occur for all new workers, so if you scaled out from 1 to 10, that is 9 new workers coming online at the same time.

How to avoid this problem?

If you use Examine APIs on your front-end, then you cannot just disable Examine/Lucene so the only reasonable solution is to use an Examine implementation that uses a hosted search service like ExamineX

If you don't use Examine APIs on your front-ends then it is a reasonable solution to disable Examine/Lucene on the front-ends to avoid this issue. To do that, you would change the default Umbraco indexes to use an in-memory only store and prohibit data from being put into the indexes. Then disable the queries that execute when Umbraco tries to re-populate the indexes.

Show me the code

First thing is to replace the default index factory. This new one will change the underlying Lucene directory for each index to be a RAMDirectory and will also disable the default Umbraco event handling that populates the indexes. This means Umbraco will not try to update the index based on content, media or member changes.

public class InMemoryExamineIndexFactory : UmbracoIndexesCreator
{
    public InMemoryExamineIndexFactory(
        IProfilingLogger profilingLogger,
        ILocalizationService languageService,
        IPublicAccessService publicAccessService,
        IMemberService memberService,
        IUmbracoIndexConfig umbracoIndexConfig)
        : base(profilingLogger, languageService, publicAccessService, memberService, umbracoIndexConfig)
    {
    }

    public override IEnumerable<IIndex> Create()
    {
        return new[]
        {
            CreateInternalIndex(),
            CreateExternalIndex(),
            CreateMemberIndex()
        };
    }

    // all of the below is the same as Umbraco defaults, except
    // we are using an in-memory Lucene directory.

    private IIndex CreateInternalIndex()
        => new UmbracoContentIndex(
            Constants.UmbracoIndexes.InternalIndexName,
            new RandomIdRamDirectory(), // in-memory dir
            new UmbracoFieldDefinitionCollection(),
            new CultureInvariantWhitespaceAnalyzer(),
            ProfilingLogger,
            LanguageService,
            UmbracoIndexConfig.GetContentValueSetValidator())
        {
            EnableDefaultEventHandler = false
        };

    private IIndex CreateExternalIndex()
        => new UmbracoContentIndex(
            Constants.UmbracoIndexes.ExternalIndexName,
            new RandomIdRamDirectory(), // in-memory dir
            new UmbracoFieldDefinitionCollection(),
            new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30),
            ProfilingLogger,
            LanguageService,
            UmbracoIndexConfig.GetPublishedContentValueSetValidator())
        {
            EnableDefaultEventHandler = false
        };

    private IIndex CreateMemberIndex()
        => new UmbracoMemberIndex(
            Constants.UmbracoIndexes.MembersIndexName,
            new UmbracoFieldDefinitionCollection(),
            new RandomIdRamDirectory(), // in-memory dir
            new CultureInvariantWhitespaceAnalyzer(),
            ProfilingLogger,
            UmbracoIndexConfig.GetMemberValueSetValidator())
        {
            EnableDefaultEventHandler = false
        };

    // required so that each ram dir has a different ID
    private class RandomIdRamDirectory : RAMDirectory
    {
        private readonly string _lockId = Guid.NewGuid().ToString();
        public override string GetLockId()
        {
            return _lockId;
        }
    }
}

The next thing to do is to create no-op index populators to replace the Umbraco default ones. All these do is ensure they are not associated with any index and then just to be sure, does not execute any logic for population.

public class DisabledMemberIndexPopulator : MemberIndexPopulator
{
    public DisabledMemberIndexPopulator(
        IMemberService memberService,
        IValueSetBuilder<IMember> valueSetBuilder)
        : base(memberService, valueSetBuilder)
    {
    }

    public override bool IsRegistered(IIndex index) => false;
    public override bool IsRegistered(IUmbracoMemberIndex index) => false;
    protected override void PopulateIndexes(IReadOnlyList<IIndex> indexes) { }
}

public class DisabledContentIndexPopulator : ContentIndexPopulator
{
    public DisabledContentIndexPopulator(
        IContentService contentService,
        ISqlContext sqlContext,
        IContentValueSetBuilder contentValueSetBuilder)
        : base(contentService, sqlContext, contentValueSetBuilder)
    {
    }

    public override bool IsRegistered(IIndex index) => false;
    public override bool IsRegistered(IUmbracoContentIndex2 index) => false;
    protected override void PopulateIndexes(IReadOnlyList<IIndex> indexes) { }
}

public class DisabledPublishedContentIndexPopulator : PublishedContentIndexPopulator
{
    public DisabledPublishedContentIndexPopulator(
        IContentService contentService,
        ISqlContext sqlContext,
        IPublishedContentValueSetBuilder contentValueSetBuilder)
        : base(contentService, sqlContext, contentValueSetBuilder)
    {
    }

    public override bool IsRegistered(IIndex index) => false;
    public override bool IsRegistered(IUmbracoContentIndex2 index) => false;
    protected override void PopulateIndexes(IReadOnlyList<IIndex> indexes) { }
}

public class DisabledMediaIndexPopulator : MediaIndexPopulator
{
    public DisabledMediaIndexPopulator(
        IMediaService mediaService,
        IValueSetBuilder<IMedia> mediaValueSetBuilder) : base(mediaService, mediaValueSetBuilder)
    {
    }

    public override bool IsRegistered(IIndex index) => false;
    public override bool IsRegistered(IUmbracoContentIndex index) => false;
    protected override void PopulateIndexes(IReadOnlyList<IIndex> indexes) { }
}

Lastly, we just need to enable these services:

public class DisabledExamineComposer : IUserComposer
{
    public void Compose(Composition composition)
    {
        // replace the default
        composition.RegisterUnique<IUmbracoIndexesCreator, InMemoryExamineIndexFactory>();

        // replace the default populators
        composition.Register<MemberIndexPopulator, DisabledMemberIndexPopulator>(Lifetime.Singleton);
        composition.Register<ContentIndexPopulator, DisabledContentIndexPopulator>(Lifetime.Singleton);
        composition.Register<PublishedContentIndexPopulator, DisabledPublishedContentIndexPopulator>(Lifetime.Singleton);
        composition.Register<MediaIndexPopulator, DisabledMediaIndexPopulator>(Lifetime.Singleton);
    }
}

With that all in place, it means that no data will ever be looked up to rebuild indexes and Umbraco will not send data to be indexed. There is nothing here preventing data from being indexed though. For example, if you use the Examine APIs to update the index directly, that data will be indexed in memory. If you wanted to absolutely make sure no data ever went into the index, you would have to override some methods on the RAMDirectory.

Can I run Examine with RAMDirectory with data?

You might have realized with the above that if you don't replace the populators, you will essentially have Examine indexes in Umbraco running from RAMDirectory. Is this ok? Yes absolutely, but that entirely depends on your data set. If you have a large index, that means it will consume large amounts of memory which is typically not a good idea. But if you have a small data set, or you filter the index so that it remains small enough, then yes! You can certainly run Examine with an in-memory directory but this would still only be advised on your front-end/replica servers.

Searching with IPublishedContentQuery in Umbraco

Thu, 23 Mar 2023 15:10:02 Z

I recently realized that I don’t think Umbraco’s APIs on IPublishedContentQuery are documented so hopefully this post may inspire some docs to be written or at least guide some folks on some functionality they may not know about.

A long while back even in Umbraco v7 UmbracoHelper was split into different components and UmbracoHelper just wrapped these. One of these components was called ITypedPublishedContentQuery and in v8 is now called IPublishedContentQuery, and this component is responsible for executing queries for content and media on the front-end in razor templates. In v8 a lot of methods were removed or obsoleted from UmbracoHelper so that it wasn’t one gigantic object and tries to steer developers to use these sub components directly instead. For example if you try to access UmbracoHelper.ContentQuery you’ll see that has been deprecated saying:

Inject and use an instance of IPublishedContentQuery in the constructor for using it in classes or get it from Current.PublishedContentQuery in views

and the UmbracoHelper.Search methods from v7 have been removed and now only exist on IPublishedContentQuery.

There are API docs for IPublishedContentQuery which are a bit helpful, at least will tell you what all available methods and parameters are. The main one’s I wanted to point out are the Search methods.

Strongly typed search responses

When you use Examine directly to search you will get an Examine ISearchResults object back which is more or less raw data. It’s possible to work with that data but most people want to work with some strongly typed data and at the very least in Umbraco with IPublishedContent. That is pretty much what IPublishedContentQuery.Search methods are solving. Each of these methods will return an IEnumerable<PublishedSearchResult> and each PublishedSearchResult contains an IPublishedContent instance along with a Score value. A quick example in razor:

@inherits Umbraco.Web.Mvc.UmbracoViewPage
@using Current = Umbraco.Web.Composing.Current;
@{
    var search = Current.PublishedContentQuery.Search(Request.QueryString["query"]);
}

<div>
    <h3>Search Results</h3>
    <ul>
        @foreach (var result in search)
        {
            <li>
                Id: @result.Content.Id
                <br/>
                Name: @result.Content.Name
                <br />
                Score: @result.Score
            </li>
        }
    </ul>
</div>

The ordering of this search is by Score so the highest score is first. This makes searching very easy while the underlying mechanism is still Examine. The IPublishedContentQuery.Search methods make working with the results a bit nicer.

Paging results

You may have noticed that there’s a few overloads and optional parameters to these search methods too. 2 of the overloads support paging parameters and these take care of all of the quirks with Lucene paging for you. I wrote a previous post about paging with Examine and you need to make sure you do that correctly else you’ll end up iterating over possibly tons of search results which can have performance problems. To expand on the above example with paging is super easy:

@inherits Umbraco.Web.Mvc.UmbracoViewPage
@using Current = Umbraco.Web.Composing.Current;
@{
    var pageSize = 10;
    var pageIndex = int.Parse(Request.QueryString["page"]);
    var search = Current.PublishedContentQuery.Search(
        Request.QueryString["query"],
        pageIndex * pageSize,   // skip
        pageSize,               // take
        out var totalRecords);
}

<div>
    <h3>Search Results</h3>
    <ul>
        @foreach (var result in search)
        {
            <li>
                Id: @result.Content.Id
                <br/>
                Name: @result.Content.Name
                <br />
                Score: @result.Score
            </li>
        }
    </ul>
</div>

Simple search with cultures

Another optional parameter you might have noticed is the culture parameter. The docs state this about the culture parameter:

When the culture is not specified or is *, all cultures are searched. To search for only invariant documents and fields use null. When searching on a specific culture, all culture specific fields are searched for the provided culture and all invariant fields for all documents. While enumerating results, the ambient culture is changed to be the searched culture.

What this is saying is that if you aren’t using culture variants in Umbraco then don’t worry about it. But if you are, you will also generally not have to worry about it either! What?! By default the simple Search method will use the “ambient” (aka ‘Current’) culture to search and return data. So if you are currently browsing your “fr-FR” culture site this method will automatically only search for your data in your French culture but will also search on any invariant (non-culture) data. And as a bonus, the IPublishedContent returned also uses this ambient culture so any values you retrieve from the content item without specifying the culture will just be the ambient/default culture.

So why is there a “culture” parameter? It’s just there in case you want to search on a specific culture instead of relying on the ambient/current one.

Search with IQueryExecutor

IQueryExecutor is the resulting object created when creating a query with the Examine fluent API. This means you can build up any complex Examine query you want, even with raw Lucene, and then pass this query to one of the IPublishedContentQuery.Search overloads and you’ll get all the goodness of the above queries. There’s also paging overloads with IQueryExecutor too. To further expand on the above example:

@inherits Umbraco.Web.Mvc.UmbracoViewPage
@using Current = Umbraco.Web.Composing.Current;
@{
    // Get the external index with error checking
    if (ExamineManager.Instance.TryGetIndex(
        Constants.UmbracoIndexes.ExternalIndexName, out var index))
    {
        throw new InvalidOperationException(
            $"No index found with name {Constants.UmbracoIndexes.ExternalIndexName}");
    }

    // build an Examine query
    var query = index.GetSearcher().CreateQuery()
        .GroupedOr(new [] { "pageTitle", "pageContent"},
            Request.QueryString["query"].MultipleCharacterWildcard());


    var pageSize = 10;
    var pageIndex = int.Parse(Request.QueryString["page"]);
    var search = Current.PublishedContentQuery.Search(
        query,                  // pass the examine query in!
        pageIndex * pageSize,   // skip
        pageSize,               // take
        out var totalRecords);
}

<div>
    <h3>Search Results</h3>
    <ul>
        @foreach (var result in search)
        {
            <li>
                Id: @result.Content.Id
                <br/>
                Name: @result.Content.Name
                <br />
                Score: @result.Score
            </li>
        }
    </ul>
</div>

The base interface of the fluent parts of Examine’s queries are IQueryExecutor so you can just pass in your query to the method and it will work.

Recap

The IPublishedContentQuery.Search overloads are listed in the API docs, they are:

Search(String term, String culture, String indexName)
Search(String term, Int32 skip, Int32 take, out Int64 totalRecords, String culture, String indexName)
Search(IQueryExecutor query)
Search(IQueryExecutor query, Int32 skip, Int32 take, out Int64 totalRecords)

Should you always use this instead of using Examine directly? As always it just depends on what you are doing. If you need a ton of flexibility with your search results than maybe you want to use Examine’s search results directly but if you want simple and quick access to IPublishedContent results, then these methods will work great.

Does this all work with ExamineX ? Absolutely!! One of the best parts of ExamineX is that it’s completely seamless. ExamineX is just an index implementation of Examine itself so all Examine APIs and therefore all Umbraco APIs that use Examine will ‘just work’.

Filtering fields dynamically with Examine

Thu, 23 Mar 2023 15:09:58 Z

The index fields created by Umbraco in Examine by default can lead to quite a substantial amount of fields. This is primarily due in part by how Umbraco handles variant/culture data because it will create a different field per culture but there are other factors as well. Umbraco will create a “__Raw_” field for each rich text field and if you use the grid, it will create different fields for each grid row type. There are good reasons for all of these fields and this allows you by default to have the most flexibility when querying and retrieving your data from the Examine indexes. But in some cases these default fields can be problematic. Examine by default uses Lucene as it’s indexing engine and Lucene itself doesn’t have any hard limits on field count (as far as I know), however if you swap the indexing engine in Examine to something else like Azure Search with ExamineX then you may find your indexes are exceeding Azure Search’s limits.

Azure Search field count limits

Azure Search has varying limits for field counts based on the tier service level you have (strangely the Free tier allows more fields than the Basic tier). The absolute maximum however is 1000 fields and although that might seem like quite a lot when you take into account all of the fields created by Umbraco you might realize it’s not that difficult to exceed this limit. As an example, lets say you have an Umbraco site using language variants and you have 20 languages in use. Then let’s say you have 15 document types each with 5 fields (all with unique aliases) and each field is variant and you have content for each of these document types and languages created. This immediately means you are exceeding the field count limits: 20 x 15 x 10 = 1500 fields! And that’s not including the “__Raw_” fields or the extra grid fields or the required system fields like “id” and “nodeName”. I’m unsure why Azure Search even has this restriction in place

Why is Umbraco creating a field per culture?

When v8 was being developed a choice had to be made about how to handle multi-lingual data in Examine/Lucene. There’s a couple factors to consider with making this decision which mostly boils down to how Lucene’s analyzers work. The choice is either: language per field or language per index. Some folks might think, can’t we ‘just’ have a language per document? Unfortunately the answer is no because that would require you to apply a specific language analyzer for that document and then scoring would no longer work between documents. Elastic Search has a good write up about this. So either language per field or different indexes per language. Each has pros/cons but Umbraco went with language per field since it’s quite easy to setup, supports different analyzers per language and doesn’t require a ton of indexes which also incurs a lot more overhead and configuration.

Do I need all of these fields?

That really depends on what you are searching on but the answer is most likely ‘no’. You probably aren’t going to be searching on over 1000s fields, but who knows every site’s requirements are different. Umbraco Examine has something called an IValueSetValidator which you can configure to include/exclude certain fields or document types. This is synonymous with part of the old XML configuration in Examine. This is one of those things where configuration can make sense for Examine and @callumwhyte has done exactly that with his package “Umbraco Examine Config”. But the IValueSetValidator isn’t all that flexible and works based on exact naming which will work great for filtering content types but perhaps not field names. (Side note – I’m unsure if the Umbraco Examine Config package will work alongside ExamineX, need to test that out).

Since Umbraco creates fields with the same prefixed names for all languages it’s relatively easy to filter the fields based on a matching prefix for the fields you want to keep.

Here’s some code!

The following code is relatively straight forward with inline comments: A custom class “IndexFieldFilter” that does the filtering and can be applied different for any index by name, a Component to apply the filtering, a Composer to register services. This code will also ensure that all Umbraco required fields are retained so anything that Umbraco is reliant upon will still work.

/// <summary>
/// Register services
/// </summary>
public class MyComposer : ComponentComposer<MyComponent>
{
    public override void Compose(Composition composition)
    {
        base.Compose(composition);
        composition.RegisterUnique<IndexFieldFilter>();
    }
}

public class MyComponent : IComponent
{
    private readonly IndexFieldFilter _indexFieldFilter;

    public MyComponent(IndexFieldFilter indexFieldFilter)
    {
        _indexFieldFilter = indexFieldFilter;
    }

    public void Initialize()
    {
        // Apply an index field filter to an index
        _indexFieldFilter.ApplyFilter(
            // Filter the external index 
            Umbraco.Core.Constants.UmbracoIndexes.ExternalIndexName, 
            // Ensure fields with this prefix are retained
            new[] { "description", "title" },
            // optional: only keep data for these content types, else keep all
            new[] { "home" });
    }

    public void Terminate() => _indexFieldFilter.Dispose();
}

/// <summary>
/// Used to filter out fields from an index
/// </summary>
public class IndexFieldFilter : IDisposable
{
    private readonly IExamineManager _examineManager;
    private readonly IUmbracoTreeSearcherFields _umbracoTreeSearcherFields;
    private ConcurrentDictionary<string, (string[] internalFields, string[] fieldPrefixes, string[] contentTypes)> _fieldNames
        = new ConcurrentDictionary<string, (string[], string[], string[])>();
    private bool disposedValue;

    /// <summary>
    /// Constructor
    /// </summary>
    /// <param name="examineManager"></param>
    /// <param name="umbracoTreeSearcherFields"></param>
    public IndexFieldFilter(
        IExamineManager examineManager,
        IUmbracoTreeSearcherFields umbracoTreeSearcherFields)
    {
        _examineManager = examineManager;
        _umbracoTreeSearcherFields = umbracoTreeSearcherFields;
    }

    /// <summary>
    /// Apply a filter to the specified index
    /// </summary>
    /// <param name="indexName"></param>
    /// <param name="includefieldNamePrefixes">
    /// Retain all fields prefixed with these names
    /// </param>
    public void ApplyFilter(
        string indexName,
        string[] includefieldNamePrefixes,
        string[] includeContentTypes = null)
    {
        if (_examineManager.TryGetIndex(indexName, out var e)
            && e is BaseIndexProvider index)
        {
            // gather all internal index names used by Umbraco 
            // to ensure they are retained
            var internalFields = new[]
                {
                LuceneIndex.CategoryFieldName,
                LuceneIndex.ItemIdFieldName,
                LuceneIndex.ItemTypeFieldName,
                UmbracoExamineIndex.IconFieldName,
                UmbracoExamineIndex.IndexPathFieldName,
                UmbracoExamineIndex.NodeKeyFieldName,
                UmbracoExamineIndex.PublishedFieldName,
                UmbracoExamineIndex.UmbracoFileFieldName,
                "nodeName"
            }
                .Union(_umbracoTreeSearcherFields.GetBackOfficeFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeDocumentFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeMediaFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeMembersFields())
                .ToArray();

            _fieldNames.TryAdd(indexName, (internalFields, includefieldNamePrefixes, includeContentTypes ?? Array.Empty<string>()));

            // Bind to the event to filter the fields
            index.TransformingIndexValues += TransformingIndexValues;
        }
        else
        {
            throw new InvalidOperationException(
                $"No index with name {indexName} found that is of type {typeof(BaseIndexProvider)}");
        }
    }

    private void TransformingIndexValues(object sender, IndexingItemEventArgs e)
    {
        if (_fieldNames.TryGetValue(e.Index.Name, out var fields))
        {
            // check if we should ignore this doc by content type
            if (fields.contentTypes.Length > 0 && !fields.contentTypes.Contains(e.ValueSet.ItemType))
            {
                e.Cancel = true;
            }
            else
            {
                // filter the fields
                e.ValueSet.Values.RemoveAll(x =>
                {
                    if (fields.internalFields.Contains(x.Key)) return false;
                    if (fields.fieldPrefixes.Any(f => x.Key.StartsWith(f))) return false;
                    return true;
                });
            }
        }
    }

    protected virtual void Dispose(bool disposing)
    {
        if (!disposedValue)
        {
            if (disposing)
            {
                // Unbind from the event for any bound indexes
                foreach (var keys in _fieldNames.Keys)
                {
                    if (_examineManager.TryGetIndex(keys, out var e)
                        && e is BaseIndexProvider index)
                    {
                        index.TransformingIndexValues -= TransformingIndexValues;
                    }
                }
            }
            disposedValue = true;
        }
    }

    public void Dispose()
    {
        Dispose(disposing: true);
        GC.SuppressFinalize(this);
    }
}

That should give you the tools you need to dynamically filter your index based on fields and content type’s if you need to get your field counts down. This would also be handy even if you aren’t using ExamineX and Azure Search since keeping an index size down and storing less data means less IO operations and storage size.

Examine and Azure Blob Storage

Thu, 23 Mar 2023 15:09:45 Z

Quite some time ago - probably close to 2 years - I created an alpha version of an extension library to Examine to allow storing Lucene indexes in Blob Storage called Examine.AzureDirectory. This idea isn’t new at all and in fact there’s been a library to do this for many years called AzureDirectory but it previously had issues and it wasn’t clear on exactly what it’s limitations are. The Examine.AzureDirectory implementation was built using a lot of the original code of AzureDirectory but has a bunch of fixes (which I contributed back to the project) and different ways of working with the data. Also since Examine 0.1.90 still worked with lucene 2.x, this also made this compatible with the older Lucene version.

… And 2 years later, I’ve actually released a real version

Why is this needed?

There’s a couple reasons – firstly Azure web apps storage run on a network share and Lucene absolutely does not like it’s files hosted on a network share, this will bring all sorts of strange performance issues among other things. The way AzureDirectory works is to store the ‘master’ index in Blob Storage and then sync the required Lucene files to the local ‘fast drive’. In Azure web apps there’s 2x drives: ‘slow drive’ (the network share) and the ‘fast drive’ which is the local server’s temp files on local storage with limited space. By syncing the Lucene files to the local fast drive it means that Lucene is no longer operating over a network share. When writes occur, it writes back to the local fast drive and then pushes those changes back to the master index in Blob Storage. This isn’t the only way to overcome this limitation of Lucene, in fact Examine has shipped a work around for many years which uses something called SyncDirectory which does more or less the same thing but instead of storing the master index in Blob Storage, the master index is just stored on the ‘slow drive’. Someone has actually taken this code and made a separate standalone project with this logic called SyncDirectory which is pretty cool!

Load balancing/Scaling

There’s a couple of ways to work around the network share storage in Azure web apps (as above), but in my opinion the main reason why this is important is for load balancing and being able to scale out. Since Lucene doesn’t work well over a network share, it means that Lucene files must exist local to the process it’s running in. That means that when you are load balancing or scaling out, each server that is handling requests will have it’s own local Lucene index. So what happens when you scale out further and another new worker goes online? This really depending on the hosting application… for example in Umbraco, this would mean that the new worker will create it’s own local indexes by rebuilding the indexes from the source data (i.e. database). This isn’t an ideal scenario especially in Umbraco v7 where requests won’t be served until the index is built and ready. A better scenario is that the new worker comes online and then syncs an existing index from master storage that is shared between all workers …. yes! like Blob Storage.

Read/Write vs Read only

Lucene can’t be written to concurrently by multiple processes. There are some workarounds here a there to try to achieve this by synchronizing processes with named mutex/semaphore locks and even AzureSearch tries to handle some of this by utilizing Blob Storage leases but it’s not a seamless experience. This is one of the reasons why Umbraco requires a ‘master’ web app for writing and a separate web app for scaling which guarantees that only one process writes to the indexes. This is the setup that Examine.AzureDirectory supports too and on the front-end/replica/slave web app that scales you will configure the provider to be readonly which guarantees it will never try to write back to the (probably locked) Blob Storage.

With this in place, when a new front-end worker goes online it doesn’t need to rebuild it’s own local indexes, it will just check if indexes exist and to do that will make sure the master index is there and then continue booting. At this stage there’s actually almost no performance overhead. Nothing actually happens with the local indexes until the index is referenced by this worker and when that happens Examine will lazily just sync the Lucene files that it needs locally.

How do I get it?

First thing to point out is that this first release is only for Examine 0.1.90 which is for Umbraco v7. Support for Examine 1.x and Umbraco 8.x will come out very soon with some slightly different install instructions.

The release notes of this are here, the install docs are here, and the Nuget package for this can be found here.

PM> Install-Package Examine.AzureDirectory -Version 0.1.90

To activate it, you need to add these settings to your web.config

<add key="examine:AzureStorageConnString" value="YOUR-STORAGE-CONNECTION-STRING" />
<add key="examine:AzureStorageContainer" value="YOUR-CONTAINER-NAME" />

Then for your master server/web app you’ll want to add a directoryFactory attribute to each of your indexers in ExamineSettings.config, for example:

<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
      supportUnpublished="true"
      supportProtected="true"
      directoryFactory="Examine.AzureDirectory.AzureDirectoryFactory, Examine.AzureDirectory"
      analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

For your front-end/replicate/slave server you’ll want a different readonly value for the directoryFactory like:

<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
      supportUnpublished="true"
      supportProtected="true"
      directoryFactory="Examine.AzureDirectory.ReadOnlyAzureDirectoryFactory, Examine.AzureDirectory"
      analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

Does it work?

Great question :) With the testing that I’ve done it works and I’ve had this running on this site for all of last year without issue but I haven’t rigorously tested this at scale with high traffic sites, etc… I’ve decided to release a real version of this because having this as an alpha/proof of concept means that nobody will test or use it. So now hopefully a few of you will give this a whirl and let everyone know how it goes. Any bugs can be submitted to the Examine repo.

How to lazily set the multi-term rewrite method on queries in Lucene

Thu, 23 Mar 2023 15:08:40 Z

For wildcard queries in Lucene that you would like to have the results ordered by Score, there’s a trick that you need to do otherwise all of your scores will come back the same. The reason for this is because the default behavior of wildcard queries uses CONSTANT_SCORE_AUTO_REWRITE_DEFAULT which as the name describes is going to give a constant score. The code comments describe why this is the default:

a) Runs faster
b) Does not have the scarcity of terms unduly influence score
c) Avoids any "TooManyBooleanClauses" exceptions

Without fully understanding Lucene that doesn’t really mean a whole lot but the Lucene docs give a little more info

NOTE: if setRewriteMethod(org.apache.lucene.search.MultiTermQuery.RewriteMethod) is either CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE or SCORING_BOOLEAN_QUERY_REWRITE, you may encounter a BooleanQuery.TooManyClauses exception during searching, which happens when the number of terms to be searched exceeds BooleanQuery.getMaxClauseCount(). Setting setRewriteMethod(org.apache.lucene.search.MultiTermQuery.RewriteMethod) to CONSTANT_SCORE_FILTER_REWRITE prevents this.
The recommended rewrite method is CONSTANT_SCORE_AUTO_REWRITE_DEFAULT: it doesn't spend CPU computing unhelpful scores, and it tries to pick the most performant rewrite method given the query. If you need scoring (like FuzzyQuery, use MultiTermQuery.TopTermsScoringBooleanQueryRewrite which uses a priority queue to only collect competitive terms and not hit this limitation. Note that org.apache.lucene.queryparser.classic.QueryParser produces MultiTermQueries using CONSTANT_SCORE_AUTO_REWRITE_DEFAULT by default.

So the gist is, unless you are ordering by Score this shouldn’t be changed because it will consume more CPU and depending on how many terms you are querying against you might get an exception (though I think that is rare).

So how do you change the default?

That’s super easy, it’s just this line of code:

QueryParser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);

But there’s a catch! You must set this flag before you parse any queries with the query parser otherwise it won’t work. All this really does is instruct the query parser to apply this scoring method to any MultiTermQuery or FuzzyQuery implementations it creates. So what if you don’t know if this change should be made before you use the query parser? One scenario might be: At the time of using the query parser, you are unsure if the user constructing the query is going to be sorting by score. In this case you want to change the scoring mechanism just before executing the search but after creating your query.

Setting the value lazily

The good news is that you can set this value lazily just before you execute the search even after you’ve used the query parser to create parts of your query. There’s only 1 class type that we need to check for that has this API: MultiTermQuery however not all implementations of it support rewriting so we have to check for that. So given an instance of a Query we can recursively update every query contained within it and manually apply the rewrite method like:

protected void SetScoringBooleanQueryRewriteMethod(Query query)
{
	if (query is MultiTermQuery mtq)
	{
		try
		{
			mtq.SetRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
		}
		catch (NotSupportedException)
		{
			//swallow this, some implementations of MultiTermQuery don't support this like FuzzyQuery
		}
	}
	if (query is BooleanQuery bq)
	{
		foreach (BooleanClause clause in bq.Clauses())
		{
			var q = clause.GetQuery();
			//recurse
			SetScoringBooleanQueryRewriteMethod(q);
		}
	}
}

So you can call this method just before you execute your search and it will still work without having to eagerly use QueryParser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE); before you use the query parser methods.

Happy searching!

Paging with Examine

Thu, 23 Mar 2023 15:08:13 Z

Paging with Lucene and Examine requires some specific API usage. It's very easy to get wrong by using Linq's Skip/Take methods and when doing this you'll inadvertently end up loading in all search results from Lucene and then filtering in memory when what you really want to do is have Lucene only create the minimal search result objects that you are interested in.

There are 2 important parts to this:

The Skip method on the ISearchResults object
The Search overload on the BaseSearchProvider where you can specify maxResults

ISearchResults.Skip

This is very different from the Linq Skip method so you need to be sure you are using the Skip method on the ISearchResults object. This tells Lucene to skip over a specific number of results without allocating the result objects. If you use Linq’s Skip method on the underlying IEnumerable<SearchResult> of ISearchResults, this will allocate all of the result objects and then filter them in memory which is what you don’t want to do.

Search with maxResults

Lucene isn’t perfect for paging because it doesn’t natively support the Linq equivalent to “Skip/Take”. It understands Skip (as above) but doesn’t understand Take, instead it only knows how to limit the max results so that it doesn’t allocate every result, most of which you would probably not need when paging.

With the combination of ISearchResult.Skip and maxResults, we can tell Lucene to:

Skip over a certain number of results without allocating them and tell Lucene
only allocate a certain number of results after skipping

Show me the code

//for example purposes, we want to show page #4 (which is pageIndex of 3)
var pageIndex = 3;   
//for this example, the page size is 10 items
var pageSize = 10;
var searchResult = searchProvider.Search(criteria, 
   //don't return more results than we need for the paging
   //this is the 'trick' - we need to load enough search results to fill
   //all pages from 1 to the current page of 4
   maxResults: pageSize*(pageIndex + 1));
//then we use the Skip method to tell Lucene to not allocate search results
//for the first 3 pages
var pagedResults = searchResult.Skip(pageIndex*pageSize);
var totalResults = searchResult.TotalItemCount;

So that is the correct way to do paging with Examine and Lucene which ensures max performance and minimal object allocations.

Examine 1.5.1 released

Thu, 23 Mar 2023 15:08:08 Z

I’ve created a new release of Examine today, version 1.5.1. There’s nothing really new in this release, just a bunch of bug fixes. The other cool thing is that I’ve finally got Examine on Nuget now. The v1.5.1 release page is here on CodePlex with upgrade instructions… which is really just replacing the DLLs.

Its important to note that if you have installed Umbraco 6.0.1+ or 4.11.5+ then you already have Examine 1.5.0 installed (which isn’t an official release on the CodePlex page) which has 8 of these 10 bugs fixed already.

Bugs fixed

Here’s the full list of bugs fixed in this release:

UmbracoExamine

You may already know this but we’ve moved the UmbracoExamine libraries in to the core of Umbraco so that the Umbraco core team can better support the implementation. That means that only the basic Examine libraries will continue to exist @ examine.codeplex.com. The release of 1.5.1 only relates to the base Examine libraries, not the UmbracoExamine libraries, but that’s ok you can still upgrade these base libraries without issue.

Nuget

There’s 2 Examine projects up on Nuget, the basic Examine package and the Azure package if you wish to use Azure directory for your indexes.

Standard package:

PM> Install-Package Examine

Azure package:

PM> Install-Package Examine.Azure

Happy searching!

New Examine updates and features for Umbraco

Thu, 23 Mar 2023 15:08:08 Z

It’s been a long while since Examine got some much needed attention and I’m pleased to say it is now happening. If you didn’t know already, we’ve moved the Umbraco Examine source in to the core of Umbraco. The underlying Examine (Examine.dll) core will remain on CodePlex but all the Umbraco bits and pieces which is found in UmbracoExamine.dll are in the Umbraco core from version 6.1+. This is great news because now we can all better support the implementation of Examine for Umbraco. More good news is that even versions prior to Umbraco 6.1 will have some bugs fixed (http://issues.umbraco.org/issue/U4-1768) ! Niels Kuhnel has also jumped aboard the Examine train and is helping out a ton by adding his amazing ‘facet’ features which will probably make it into an Umbraco release around version 6.2 (maybe 6.1, but still need to do some review, etc… to make sure its 100% backwards compatible).

One other bit of cool news is that we’re adding an official Examine Management dashboard to Umbraco 6.1. In its present state it supports optimizing indexes, rebuilding indexes and searching them. I’ve created a quick video showing its features :)

Examine management dashboard for Umbraco