Filtering fields dynamically with Examine

Filtering fields dynamically with Examine

The index fields created by Umbraco in Examine by default can lead to quite a substantial amount of fields. This is primarily due in part by how Umbraco handles variant/culture data because it will create a different field per culture but there are other factors as well. Umbraco will create a “__Raw_” field for each rich text field and if you use the grid, it will create different fields for each grid row type. There are good reasons for all of these fields and this allows you by default to have the most flexibility when querying and retrieving your data from the Examine indexes. But in some cases these default fields can be problematic. Examine by default uses Lucene as it’s indexing engine and Lucene itself doesn’t have any hard limits on field count (as far as I know), however if you swap the indexing engine in Examine to something else like Azure Search with ExamineX then you may find your indexes are exceeding Azure Search’s limits.

Azure Search field count limits

Azure Search has varying limits for field counts based on the tier service level you have (strangely the Free tier allows more fields than the Basic tier). The absolute maximum however is 1000 fields and although that might seem like quite a lot when you take into account all of the fields created by Umbraco you might realize it’s not that difficult to exceed this limit. As an example, lets say you have an Umbraco site using language variants and you have 20 languages in use. Then let’s say you have 15 document types each with 5 fields (all with unique aliases) and each field is variant and you have content for each of these document types and languages created. This immediately means you are exceeding the field count limits: 20 x 15 x 10 = 1500 fields! And that’s not including the “__Raw_” fields or the extra grid fields or the required system fields like “id” and “nodeName”. I’m unsure why Azure Search even has this restriction in place

Why is Umbraco creating a field per culture?

When v8 was being developed a choice had to be made about how to handle multi-lingual data in Examine/Lucene. There’s a couple factors to consider with making this decision which mostly boils down to how Lucene’s analyzers work. The choice is either: language per field or language per index. Some folks might think, can’t we ‘just’ have a language per document? Unfortunately the answer is no because that would require you to apply a specific language analyzer for that document and then scoring would no longer work between documents. Elastic Search has a good write up about this. So either language per field or different indexes per language. Each has pros/cons but Umbraco went with language per field since it’s quite easy to setup, supports different analyzers per language and doesn’t require a ton of indexes which also incurs a lot more overhead and configuration.

Do I need all of these fields?

That really depends on what you are searching on but the answer is most likely ‘no’. You probably aren’t going to be searching on over 1000s fields, but who knows every site’s requirements are different. Umbraco Examine has something called an IValueSetValidator which you can configure to include/exclude certain fields or document types. This is synonymous with part of the old XML configuration in Examine. This is one of those things where configuration can make sense for Examine and @callumwhyte has done exactly that with his package “Umbraco Examine Config”. But the IValueSetValidator isn’t all that flexible and works based on exact naming which will work great for filtering content types but perhaps not field names. (Side note – I’m unsure if the Umbraco Examine Config package will work alongside ExamineX, need to test that out).

Since Umbraco creates fields with the same prefixed names for all languages it’s relatively easy to filter the fields based on a matching prefix for the fields you want to keep.

Here’s some code!

The following code is relatively straight forward with inline comments: A custom class “IndexFieldFilter” that does the filtering and can be applied different for any index by name, a Component to apply the filtering, a Composer to register services. This code will also ensure that all Umbraco required fields are retained so anything that Umbraco is reliant upon will still work.

/// <summary>
/// Register services
/// </summary>
public class MyComposer : ComponentComposer<MyComponent>
{
    public override void Compose(Composition composition)
    {
        base.Compose(composition);
        composition.RegisterUnique<IndexFieldFilter>();
    }
}

public class MyComponent : IComponent
{
    private readonly IndexFieldFilter _indexFieldFilter;

    public MyComponent(IndexFieldFilter indexFieldFilter)
    {
        _indexFieldFilter = indexFieldFilter;
    }

    public void Initialize()
    {
        // Apply an index field filter to an index
        _indexFieldFilter.ApplyFilter(
            // Filter the external index 
            Umbraco.Core.Constants.UmbracoIndexes.ExternalIndexName, 
            // Ensure fields with this prefix are retained
            new[] { "description", "title" },
            // optional: only keep data for these content types, else keep all
            new[] { "home" });
    }

    public void Terminate() => _indexFieldFilter.Dispose();
}

/// <summary>
/// Used to filter out fields from an index
/// </summary>
public class IndexFieldFilter : IDisposable
{
    private readonly IExamineManager _examineManager;
    private readonly IUmbracoTreeSearcherFields _umbracoTreeSearcherFields;
    private ConcurrentDictionary<string, (string[] internalFields, string[] fieldPrefixes, string[] contentTypes)> _fieldNames
        = new ConcurrentDictionary<string, (string[], string[], string[])>();
    private bool disposedValue;

    /// <summary>
    /// Constructor
    /// </summary>
    /// <param name="examineManager"></param>
    /// <param name="umbracoTreeSearcherFields"></param>
    public IndexFieldFilter(
        IExamineManager examineManager,
        IUmbracoTreeSearcherFields umbracoTreeSearcherFields)
    {
        _examineManager = examineManager;
        _umbracoTreeSearcherFields = umbracoTreeSearcherFields;
    }

    /// <summary>
    /// Apply a filter to the specified index
    /// </summary>
    /// <param name="indexName"></param>
    /// <param name="includefieldNamePrefixes">
    /// Retain all fields prefixed with these names
    /// </param>
    public void ApplyFilter(
        string indexName,
        string[] includefieldNamePrefixes,
        string[] includeContentTypes = null)
    {
        if (_examineManager.TryGetIndex(indexName, out var e)
            && e is BaseIndexProvider index)
        {
            // gather all internal index names used by Umbraco 
            // to ensure they are retained
            var internalFields = new[]
                {
                LuceneIndex.CategoryFieldName,
                LuceneIndex.ItemIdFieldName,
                LuceneIndex.ItemTypeFieldName,
                UmbracoExamineIndex.IconFieldName,
                UmbracoExamineIndex.IndexPathFieldName,
                UmbracoExamineIndex.NodeKeyFieldName,
                UmbracoExamineIndex.PublishedFieldName,
                UmbracoExamineIndex.UmbracoFileFieldName,
                "nodeName"
            }
                .Union(_umbracoTreeSearcherFields.GetBackOfficeFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeDocumentFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeMediaFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeMembersFields())
                .ToArray();

            _fieldNames.TryAdd(indexName, (internalFields, includefieldNamePrefixes, includeContentTypes ?? Array.Empty<string>()));

            // Bind to the event to filter the fields
            index.TransformingIndexValues += TransformingIndexValues;
        }
        else
        {
            throw new InvalidOperationException(
                $"No index with name {indexName} found that is of type {typeof(BaseIndexProvider)}");
        }
    }

    private void TransformingIndexValues(object sender, IndexingItemEventArgs e)
    {
        if (_fieldNames.TryGetValue(e.Index.Name, out var fields))
        {
            // check if we should ignore this doc by content type
            if (fields.contentTypes.Length > 0 && !fields.contentTypes.Contains(e.ValueSet.ItemType))
            {
                e.Cancel = true;
            }
            else
            {
                // filter the fields
                e.ValueSet.Values.RemoveAll(x =>
                {
                    if (fields.internalFields.Contains(x.Key)) return false;
                    if (fields.fieldPrefixes.Any(f => x.Key.StartsWith(f))) return false;
                    return true;
                });
            }
        }
    }

    protected virtual void Dispose(bool disposing)
    {
        if (!disposedValue)
        {
            if (disposing)
            {
                // Unbind from the event for any bound indexes
                foreach (var keys in _fieldNames.Keys)
                {
                    if (_examineManager.TryGetIndex(keys, out var e)
                        && e is BaseIndexProvider index)
                    {
                        index.TransformingIndexValues -= TransformingIndexValues;
                    }
                }
            }
            disposedValue = true;
        }
    }

    public void Dispose()
    {
        Dispose(disposing: true);
        GC.SuppressFinalize(this);
    }
}

That should give you the tools you need to dynamically filter your index based on fields and content type’s if you need to get your field counts down. This would also be handy even if you aren’t using ExamineX and Azure Search since keeping an index size down and storing less data means less IO operations and storage size.

Author

Shannon Thompson

I'm a Senior Software Engineer working full time at Microsoft. Previously, I was working at Umbraco HQ for about 10 years. I maintain several open source projects (many related to Umbraco) such as Articulate, Examine and Smidge, and I also have a commercial software offering called ExamineX. Welcome to my blog :)

comments powered by Disqus