Shazwazza

An Examine fix for Umbraco index corruption

Wed, 31 Jul 2024 17:49:54 Z

A new Examine version 3.3.0 has been released to address a long awaited bug fix for Umbraco websites that use the SyncedFileSystemDirectoryFactory which is the default setting for Umbraco CMS.

The bug typically means that indexes cannot be used and log entries such as Lucene.Net.Index.CorruptIndexException: invalid deletion count: 2 vs docCount=1 are present.

Understanding the problem:

The SyncedFileSystemDirectoryFactory directory is there to avoid performance implications of rebuilding indexes on startup when a site is moved to another worker in Azure. The reason an index rebuild would occur is because in Azure, Lucene files need to work off of the local 'fast drive' (C:\%temp%), not the default/shared network 'slow drive' (D:), and whenever a site is moved, or spawned on a new worker in Azure, the local 'fast drive' is empty, meaning no indexes exist. The SyncedFileSystemDirectoryFactory attempts to work around this challenge by continually synchronizing a copy of the indexes from the 'fast drive' to the 'slow drive' so that when a site is moved to another worker, it can sync (restore) from the 'slow drive' back to the 'fast drive' in order to avoid the index rebuild overhead.

The problem with SyncedFileSystemDirectoryFactory is that this implementation doesn't take into account what happens if the index files in your main storage ('slow drive') become corrupted which can happen for a number of reasons - misconfiguration, network latency, process termination, etc...

Understanding the solution:

The SyncedFileSystemDirectoryFactory has been updated to:

Check the health of the main index if it exists ('slow drive').
Check the health of the local index if it exists ('fast drive').
If the main index is unhealthy or doesn't exist and the local index is healthy, it will synchronize the local index to the main index. This can occur only if a site hasn't moved to a new worker.
If the main index is unhealthy and the local index doesn't exist or is unhealthy, then it will delete the main (corrupted) index.
Once health checks are done, the index from main is always synced to local. If the main index was deleted due to corruption, this will mean that the local index is empty and an index rebuild will occur.

This change will attempt to keep any healthy index that is available (main vs local), but if nothing can be read, the indexes will be deleted and an index rebuild will occur.

There's also a new option to fix a corrupted index but this is not enabled by default since it can mean a loss of documents.

Understanding the rebuilding overhead

The performance overhead of index rebuilding is due to the Umbraco database queries that need to be executed in order to populate the indexes. The only reason SyncedFileSystemDirectoryFactory exists is to prevent this overhead when hosting in Azure App Service (which is what Umbraco Cloud uses), and it can only be used on your Umbraco primary node. It does not prevent index rebuilding overhead for non-primary nodes when load balancing or scaling out because the main network 'slow drive' is shared between all workers and an index can only be read/written to be a single process.

This means that it's only useful if you are hosting in Azure App Service without any load balancing while keeping in mind that it does not always prevent index rebuilds (see above).

The index rebuilding overhead can be dramatic when load balancing or scaling out, for example: If you scale out to +5 nodes in a load balancing setup, that means that 5x nodes will be performing index rebuilds around the same time, this means that your DB is going to get hammered by queries to build all of those new indexes. The performance hit isn't the index building - it is the DB queries and this can lead to DB locks and lead to the dreaded SQL Lock Timeout issue in the Umbraco back office. Plus, if search is critical to your front-end, than for a while after your site has started up, there won't be any index which means there won't be any search until the background processing is done.

... Many of these reasons is why ExamineX was created:

Reliably and centrally persisted indexes means nothing is out of sync between nodes.
No index rebuilding when your site is moved or scaled = no rebuilding overhead.
Prevents SQL Timeout locks due to DB rebuilding queries.
Very easy to setup and seamlessly changes your Lucene based indexes to Azure/Elastic search indexes.
Ideal when hosting Umbraco on Azure Web Apps (or Umbraco Cloud) and a perfect solution for load balancing and scaling.
Automatically index Umbraco media file content without the need for additional indexes with support for PDFs, Microsoft Office documents and more.
(coming soon) Automatically generate Umbraco media image descriptions, tags, locations, and more using AI allowing your editors to quickly find the media/images they need.

Configuring a Suggester with ExamineX and Azure AI Search

Wed, 01 May 2024 14:51:52 Z

We recently had a customer ask about integrating the Azure AI Search suggester API with ExamineX and this turns out to be quite straight forward :)

First thing is to ensure that the fields you want to use the Suggester for are configured correctly:

cs // Configuration for the External Index for the City and Country fields. // The Azure AI Search Suggester requires fields to be 'string fields only' // and using the Standard Analyzer (default on the External Index) services.Configure<AzureSearchIndexOptions>( Constants.UmbracoIndexes.ExternalIndexName, options => { options.FieldDefinitions = new FieldDefinitionCollection( new FieldDefinition("City", AzureSearchFieldDefinitionTypes.FullText), new FieldDefinition("Country", AzureSearchFieldDefinitionTypes.FulText)); }); Next up, is to create the Suggester which is done as part of the CreatingOrUpdatingIndex event:

```cs externalIndex.CreatingOrUpdatingIndex += AzureIndexCreatingOrUpdatingIndexScoringProfile;

private void AzureIndexCreatingOrUpdatingIndexSuggester(object sender, CreatingOrUpdatingIndexEventArgs e) { switch (e.EventType) { case IndexModifiedEventType.Creating: case IndexModifiedEventType.Rebuilding: var index = e.AzureSearchIndexDefinition;

        var suggester = new SearchSuggester(
            "sg",
            new[] { "Country", "City" });

        index.Suggesters.Add(suggester);

        break;
}

} ``` When data has been indexed for these fields, using the Suggester API is simple. For example, lets assume that the following ValueSets were indexed:

cs externalIndex.IndexItems(new[] { ValueSet.FromObject(id1, new { City = "Calgary", Country = "Canada" }), ValueSet.FromObject(id2, new { City = "Sydney", Country = "Australia" }), ValueSet.FromObject(id3, new { City = "Copenhagen", Country = "Denmark" }), ValueSet.FromObject(id4, new { City = "Vienna", Country = "Austria" }), }); Then to get suggestions for "Au":

cs // cast to AzureSearchIndex to expose underlying Azure Search APIs var azureSearchIndex = (AzureSearchIndex)externalIndex; var result = await azureSearchIndex.IndexClient.SuggestAsync<ExamineDocument>( "Au", "sg");

The result will contain 2x entries, one for the Australia and one for the Austria documents. Super easy!

One thing to watch out for is if your index is not configured to use the Standard Analyzer by default (i.e. like the Internal Index in Umbraco). In that case, you'll need to define a custom field type with that indexer:

```cs services.Configure( Constants.UmbracoIndexes.ExternalIndexName, options => { // This adds a custom field type that uses the Standard Analyzer options.IndexValueTypesFactory = new Dictionary { ["standard"] = new AzureSearchFieldValueTypeFactory(s => new AzureSearchFieldValueType(s, SearchFieldDataType.String, LexicalAnalyzerName.StandardLucene.ToString())) };

    // Set these field types to the one defined above
    options.FieldDefinitions = new FieldDefinitionCollection(
        new FieldDefinition("City", "standard"),
        new FieldDefinition("Country", "standard"));
});

``` Happy Searching!

Can I disable Examine indexes on Umbraco front-end servers?

Thu, 23 Mar 2023 15:10:30 Z

In Umbraco v8, Examine and Lucene are only used for the back office searches, unless you specifically use those APIs for your front-end pages. I recently had a request to know if it’s possible to disable Examine/Lucene for front-end servers since they didn’t use Examine/Lucene APIs at all on their front-end pages… here’s the answer

Why would you want this?

If you are running a Load Balancing setup in Azure App Service then you have the option to scale out (and perhaps you do!). In this case, you need to have the Examine configuration option of:

<add key="Umbraco.Examine.LuceneDirectoryFactory" 
          value="Examine.LuceneEngine.Directories.TempEnvDirectoryFactory, Examine" />

This is because each scaled out worker is running from the same network share file system. Without this setting (or with the SyncTempEnvDirectoryFactory setting) it would mean that each worker will be trying to write Lucene file based indexes to the same location which will result in corrupt indexes and locked files. Using the TempEnvDirectoryFactory means that the indexes will only be stored on the worker's local 'fast drive' which is in it's %temp% folder on the local (non-network share) hard disk.

When a site is moved or provisioned on a new worker the local %temp% location will be empty so Lucene indexes will be rebuilt on startup for that worker. This will occur when Azure moves a site or when a new worker comes online from a scale out action. When indexes are rebuilt, the worker will query the database for the data and depending on how much data you have in your Umbraco installation, this could take a few minutes which can be problematic. Why? Because Umbraco v8 uses distributed SQL locks to ensure data integrity and during these queries a content lock will be created which means other back office operations on content will need to wait. This can end up with SQL Lock timeout issues. An important thing to realize is that these rebuild queries will occur for all new workers, so if you scaled out from 1 to 10, that is 9 new workers coming online at the same time.

How to avoid this problem?

If you use Examine APIs on your front-end, then you cannot just disable Examine/Lucene so the only reasonable solution is to use an Examine implementation that uses a hosted search service like ExamineX

If you don't use Examine APIs on your front-ends then it is a reasonable solution to disable Examine/Lucene on the front-ends to avoid this issue. To do that, you would change the default Umbraco indexes to use an in-memory only store and prohibit data from being put into the indexes. Then disable the queries that execute when Umbraco tries to re-populate the indexes.

Show me the code

First thing is to replace the default index factory. This new one will change the underlying Lucene directory for each index to be a RAMDirectory and will also disable the default Umbraco event handling that populates the indexes. This means Umbraco will not try to update the index based on content, media or member changes.

public class InMemoryExamineIndexFactory : UmbracoIndexesCreator
{
    public InMemoryExamineIndexFactory(
        IProfilingLogger profilingLogger,
        ILocalizationService languageService,
        IPublicAccessService publicAccessService,
        IMemberService memberService,
        IUmbracoIndexConfig umbracoIndexConfig)
        : base(profilingLogger, languageService, publicAccessService, memberService, umbracoIndexConfig)
    {
    }

    public override IEnumerable<IIndex> Create()
    {
        return new[]
        {
            CreateInternalIndex(),
            CreateExternalIndex(),
            CreateMemberIndex()
        };
    }

    // all of the below is the same as Umbraco defaults, except
    // we are using an in-memory Lucene directory.

    private IIndex CreateInternalIndex()
        => new UmbracoContentIndex(
            Constants.UmbracoIndexes.InternalIndexName,
            new RandomIdRamDirectory(), // in-memory dir
            new UmbracoFieldDefinitionCollection(),
            new CultureInvariantWhitespaceAnalyzer(),
            ProfilingLogger,
            LanguageService,
            UmbracoIndexConfig.GetContentValueSetValidator())
        {
            EnableDefaultEventHandler = false
        };

    private IIndex CreateExternalIndex()
        => new UmbracoContentIndex(
            Constants.UmbracoIndexes.ExternalIndexName,
            new RandomIdRamDirectory(), // in-memory dir
            new UmbracoFieldDefinitionCollection(),
            new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30),
            ProfilingLogger,
            LanguageService,
            UmbracoIndexConfig.GetPublishedContentValueSetValidator())
        {
            EnableDefaultEventHandler = false
        };

    private IIndex CreateMemberIndex()
        => new UmbracoMemberIndex(
            Constants.UmbracoIndexes.MembersIndexName,
            new UmbracoFieldDefinitionCollection(),
            new RandomIdRamDirectory(), // in-memory dir
            new CultureInvariantWhitespaceAnalyzer(),
            ProfilingLogger,
            UmbracoIndexConfig.GetMemberValueSetValidator())
        {
            EnableDefaultEventHandler = false
        };

    // required so that each ram dir has a different ID
    private class RandomIdRamDirectory : RAMDirectory
    {
        private readonly string _lockId = Guid.NewGuid().ToString();
        public override string GetLockId()
        {
            return _lockId;
        }
    }
}

The next thing to do is to create no-op index populators to replace the Umbraco default ones. All these do is ensure they are not associated with any index and then just to be sure, does not execute any logic for population.

public class DisabledMemberIndexPopulator : MemberIndexPopulator
{
    public DisabledMemberIndexPopulator(
        IMemberService memberService,
        IValueSetBuilder<IMember> valueSetBuilder)
        : base(memberService, valueSetBuilder)
    {
    }

    public override bool IsRegistered(IIndex index) => false;
    public override bool IsRegistered(IUmbracoMemberIndex index) => false;
    protected override void PopulateIndexes(IReadOnlyList<IIndex> indexes) { }
}

public class DisabledContentIndexPopulator : ContentIndexPopulator
{
    public DisabledContentIndexPopulator(
        IContentService contentService,
        ISqlContext sqlContext,
        IContentValueSetBuilder contentValueSetBuilder)
        : base(contentService, sqlContext, contentValueSetBuilder)
    {
    }

    public override bool IsRegistered(IIndex index) => false;
    public override bool IsRegistered(IUmbracoContentIndex2 index) => false;
    protected override void PopulateIndexes(IReadOnlyList<IIndex> indexes) { }
}

public class DisabledPublishedContentIndexPopulator : PublishedContentIndexPopulator
{
    public DisabledPublishedContentIndexPopulator(
        IContentService contentService,
        ISqlContext sqlContext,
        IPublishedContentValueSetBuilder contentValueSetBuilder)
        : base(contentService, sqlContext, contentValueSetBuilder)
    {
    }

    public override bool IsRegistered(IIndex index) => false;
    public override bool IsRegistered(IUmbracoContentIndex2 index) => false;
    protected override void PopulateIndexes(IReadOnlyList<IIndex> indexes) { }
}

public class DisabledMediaIndexPopulator : MediaIndexPopulator
{
    public DisabledMediaIndexPopulator(
        IMediaService mediaService,
        IValueSetBuilder<IMedia> mediaValueSetBuilder) : base(mediaService, mediaValueSetBuilder)
    {
    }

    public override bool IsRegistered(IIndex index) => false;
    public override bool IsRegistered(IUmbracoContentIndex index) => false;
    protected override void PopulateIndexes(IReadOnlyList<IIndex> indexes) { }
}

Lastly, we just need to enable these services:

public class DisabledExamineComposer : IUserComposer
{
    public void Compose(Composition composition)
    {
        // replace the default
        composition.RegisterUnique<IUmbracoIndexesCreator, InMemoryExamineIndexFactory>();

        // replace the default populators
        composition.Register<MemberIndexPopulator, DisabledMemberIndexPopulator>(Lifetime.Singleton);
        composition.Register<ContentIndexPopulator, DisabledContentIndexPopulator>(Lifetime.Singleton);
        composition.Register<PublishedContentIndexPopulator, DisabledPublishedContentIndexPopulator>(Lifetime.Singleton);
        composition.Register<MediaIndexPopulator, DisabledMediaIndexPopulator>(Lifetime.Singleton);
    }
}

With that all in place, it means that no data will ever be looked up to rebuild indexes and Umbraco will not send data to be indexed. There is nothing here preventing data from being indexed though. For example, if you use the Examine APIs to update the index directly, that data will be indexed in memory. If you wanted to absolutely make sure no data ever went into the index, you would have to override some methods on the RAMDirectory.

Can I run Examine with RAMDirectory with data?

You might have realized with the above that if you don't replace the populators, you will essentially have Examine indexes in Umbraco running from RAMDirectory. Is this ok? Yes absolutely, but that entirely depends on your data set. If you have a large index, that means it will consume large amounts of memory which is typically not a good idea. But if you have a small data set, or you filter the index so that it remains small enough, then yes! You can certainly run Examine with an in-memory directory but this would still only be advised on your front-end/replica servers.

Spatial Search with Examine and Lucene

Thu, 23 Mar 2023 15:10:06 Z

I was asked about how to do Spatial search with Examine recently which sparked my interest on how that should be done so here’s how it goes…

Examine’s default implementation is Lucene so by default whatever you can do in Lucene you can achieve in Examine by exposing the underlying Lucene bits. If you want to jump straight to code, I’ve created a couple of unit tests in the Examine project.

Source code as documentation

Lucene.Net and Lucene (Java) are more or less the same. There’s a few API and naming conventions differences but at the end of the day Lucene.Net is just a .NET port of Lucene. So pretty much any of the documentation you’ll find for Lucene will work with Lucene.Net just with a bit of tweaking. Same goes for code snippets in the source code and Lucene and Lucene.Net have tons of examples of how to do things. In fact for Spatial search there’s a specific test example for that.

So we ‘just’ need to take that example and go with it.

Firstly we’ll need the Lucene.Net.Contrib package:

Install-Package Lucene.Net.Contrib -Version 3.0.3

Indexing

The indexing part doesn't really need to do anything out of the ordinary from what you would normally do. You just need to get either latitude/longitude or x/y (numerical) values into your index. This can be done directly using a ValueSet when you index and having your field types set as numeric or it could be done with the DocumentWriting event which gives you direct access to the underlying Lucene document.

Strategies

For this example I’m just going to stick with simple Geo Spatial searching with simple x/y coordinates. There’s different “stategies” and you can configure these to handle different types of spatial search when it’s not just as simple as an x/y distance calculation. I was shown an example of a Spatial search that used the “PointVectorStrategy” but after looking into that it seems like this is a semi deprecated strategy and even one of it’s methods says: “//TODO this is basically old code that hasn't been verified well and should probably be removed” and then I found an SO article stating that “RecursivePrefixTreeStrategy” was what should be used instead anyways and as it turns out that’s exactly what the java example uses too.

If you need some more advanced Spatial searching then I’d suggest researching some of the strategies available, reading the docs and looking at the source examples. There’s unit tests for pretty much everything in Lucene and Lucene.Net.

Get the underlying Lucene Searcher instance

If you need to do some interesting Lucene things with Examine you need to gain access to the underlying Lucene bits. Namely you’d normally only need access to the IndexWriter which you can get from LuceneIndex.GetIndexWriter() and the Lucene Searcher which you can get from LuceneSearcher.GetSearcher().

// Get an index from the IExamineManager
if (!examineMgr.TryGetIndex("MyIndex", out var index))
    throw new InvalidOperationException("No index found with name MyIndex");
            
// We are expecting this to be a LuceneIndex
if (!(index is LuceneIndex luceneIndex))
    throw new InvalidOperationException("Index MyIndex is not a LuceneIndex");

// If you wanted a LuceneWriter, here's how:
//var luceneWriter = luceneIndex.GetIndexWriter();

// Need to cast in order to expose the Lucene bits
var searcher = (LuceneSearcher)luceneIndex.GetSearcher();

// Get the underlying Lucene Searcher instance
var luceneSearcher = searcher.GetLuceneSearcher();

Do the search

Important! Latitude/Longitude != X/Y

The Lucene GEO Spatial APIs take an X/Y coordinates, not latitude/longitude and a common mistake is to just use them in place but that’s incorrect and they are actually opposite so be sure you swap tham. Latitude = Y, Longitude = X. Here’s a simple function to swap them:

private void GetXYFromCoords(double lat, double lng, out double x, out double y)
{
    // change to x/y coords, longitude = x, latitude = y
    x = lng;
    y = lat;
}

Now that we have the underlying Lucene Searcher instance we can search however we want:

// Create the Geo Spatial lucene objects
SpatialContext ctx = SpatialContext.GEO;
int maxLevels = 11; //results in sub-meter precision for geohash
SpatialPrefixTree grid = new GeohashPrefixTree(ctx, maxLevels);
RecursivePrefixTreeStrategy strategy = new RecursivePrefixTreeStrategy(grid, GeoLocationFieldName);

// lat/lng of Sydney Australia
var latitudeSydney = -33.8688;
var longitudeSydney = 151.2093;
            
// search within 100 KM
var searchRadiusInKm = 100;

// convert to X/Y
GetXYFromCoords(latitudeSydney, longitudeSydney, out var x, out var y);

// Make a circle around the search point
var args = new SpatialArgs(
    SpatialOperation.Intersects,
    ctx.MakeCircle(x, y, DistanceUtils.Dist2Degrees(searchRadiusInKm, DistanceUtils.EARTH_MEAN_RADIUS_KM)));

// Create the Lucene Filter
var filter = strategy.MakeFilter(args);

// Create the Lucene Query
var query = strategy.MakeQuery(args);

// sort on ID
Sort idSort = new Sort(new SortField(LuceneIndex.ItemIdFieldName, SortField.INT));
TopDocs docs = luceneSearcher.Search(query, filter, MaxResultDocs, idSort);

// iterate raw lucene results
foreach(var doc in docs.ScoreDocs)
{
    // TODO: Do something with result
}

Filter vs Query?

The above code creates both a Filter and a Query that is being used to get the results but the SpatialExample just uses a “MatchAllDocsQuery” instead of what is done above. Both return the same results so what is happening with “strategy.MakeQuery”? It’s creating a ConstantScoreQuery which means that the resulting document “Score” will be empty/same for all results. That’s really all this does so it’s optional but really when searching on only locations with no other data Score doesn’t make a ton of sense. It is possible however to mix Spatial search filters with real queries.

Next steps

You’ll see above that the ordering is by Id but probably in a lot of cases you’ll want to sort by distance. There’s examples of this in the Lucene SpatialExample linked above and there’s a reference to that in this SO article too, the only problem is those examples are for later Lucene versions than the current Lucene.Net 3.x. But if there’s a will there’s a way and I’m sure with some Googling, code researching and testing you’ll be able to figure it out :)

The Examine docs pages need a little love and should probably include this info. The docs pages are just built in Jekyll and located in the /docs folder of the Examine repository. I would love any help with Examine’s docs if you’ve got a bit of time :)

As far as Examine goes though, there’s actually custom search method called “LuceneQuery” on the “LuceneSearchQueryBase” which is the object created when creating normal Examine queries with CreateQuery(). Using this method you can pass in a native Lucene Query instance like the one created above and it will manage all of the searching/paging/sorting/results/etc… for you so you don’t have to do some of the above manual work. However there is currently no method allowing a native Lucene Filter instances to be passed in like the one created above. Once that’s in place then some of the lucene APIs above wont be needed and this can be a bit nicer. Then it’s probably worthwhile adding another Nuget project like Examine.Extensions which can contain methods and functionality for this stuff, or maybe the community can do something like that just like Callum has done for Examine Facets. What do you think?

Searching with IPublishedContentQuery in Umbraco

Thu, 23 Mar 2023 15:10:02 Z

I recently realized that I don’t think Umbraco’s APIs on IPublishedContentQuery are documented so hopefully this post may inspire some docs to be written or at least guide some folks on some functionality they may not know about.

A long while back even in Umbraco v7 UmbracoHelper was split into different components and UmbracoHelper just wrapped these. One of these components was called ITypedPublishedContentQuery and in v8 is now called IPublishedContentQuery, and this component is responsible for executing queries for content and media on the front-end in razor templates. In v8 a lot of methods were removed or obsoleted from UmbracoHelper so that it wasn’t one gigantic object and tries to steer developers to use these sub components directly instead. For example if you try to access UmbracoHelper.ContentQuery you’ll see that has been deprecated saying:

Inject and use an instance of IPublishedContentQuery in the constructor for using it in classes or get it from Current.PublishedContentQuery in views

and the UmbracoHelper.Search methods from v7 have been removed and now only exist on IPublishedContentQuery.

There are API docs for IPublishedContentQuery which are a bit helpful, at least will tell you what all available methods and parameters are. The main one’s I wanted to point out are the Search methods.

Strongly typed search responses

When you use Examine directly to search you will get an Examine ISearchResults object back which is more or less raw data. It’s possible to work with that data but most people want to work with some strongly typed data and at the very least in Umbraco with IPublishedContent. That is pretty much what IPublishedContentQuery.Search methods are solving. Each of these methods will return an IEnumerable<PublishedSearchResult> and each PublishedSearchResult contains an IPublishedContent instance along with a Score value. A quick example in razor:

@inherits Umbraco.Web.Mvc.UmbracoViewPage
@using Current = Umbraco.Web.Composing.Current;
@{
    var search = Current.PublishedContentQuery.Search(Request.QueryString["query"]);
}

<div>
    <h3>Search Results</h3>
    <ul>
        @foreach (var result in search)
        {
            <li>
                Id: @result.Content.Id
                <br/>
                Name: @result.Content.Name
                <br />
                Score: @result.Score
            </li>
        }
    </ul>
</div>

The ordering of this search is by Score so the highest score is first. This makes searching very easy while the underlying mechanism is still Examine. The IPublishedContentQuery.Search methods make working with the results a bit nicer.

Paging results

You may have noticed that there’s a few overloads and optional parameters to these search methods too. 2 of the overloads support paging parameters and these take care of all of the quirks with Lucene paging for you. I wrote a previous post about paging with Examine and you need to make sure you do that correctly else you’ll end up iterating over possibly tons of search results which can have performance problems. To expand on the above example with paging is super easy:

@inherits Umbraco.Web.Mvc.UmbracoViewPage
@using Current = Umbraco.Web.Composing.Current;
@{
    var pageSize = 10;
    var pageIndex = int.Parse(Request.QueryString["page"]);
    var search = Current.PublishedContentQuery.Search(
        Request.QueryString["query"],
        pageIndex * pageSize,   // skip
        pageSize,               // take
        out var totalRecords);
}

<div>
    <h3>Search Results</h3>
    <ul>
        @foreach (var result in search)
        {
            <li>
                Id: @result.Content.Id
                <br/>
                Name: @result.Content.Name
                <br />
                Score: @result.Score
            </li>
        }
    </ul>
</div>

Simple search with cultures

Another optional parameter you might have noticed is the culture parameter. The docs state this about the culture parameter:

When the culture is not specified or is *, all cultures are searched. To search for only invariant documents and fields use null. When searching on a specific culture, all culture specific fields are searched for the provided culture and all invariant fields for all documents. While enumerating results, the ambient culture is changed to be the searched culture.

What this is saying is that if you aren’t using culture variants in Umbraco then don’t worry about it. But if you are, you will also generally not have to worry about it either! What?! By default the simple Search method will use the “ambient” (aka ‘Current’) culture to search and return data. So if you are currently browsing your “fr-FR” culture site this method will automatically only search for your data in your French culture but will also search on any invariant (non-culture) data. And as a bonus, the IPublishedContent returned also uses this ambient culture so any values you retrieve from the content item without specifying the culture will just be the ambient/default culture.

So why is there a “culture” parameter? It’s just there in case you want to search on a specific culture instead of relying on the ambient/current one.

Search with IQueryExecutor

IQueryExecutor is the resulting object created when creating a query with the Examine fluent API. This means you can build up any complex Examine query you want, even with raw Lucene, and then pass this query to one of the IPublishedContentQuery.Search overloads and you’ll get all the goodness of the above queries. There’s also paging overloads with IQueryExecutor too. To further expand on the above example:

@inherits Umbraco.Web.Mvc.UmbracoViewPage
@using Current = Umbraco.Web.Composing.Current;
@{
    // Get the external index with error checking
    if (ExamineManager.Instance.TryGetIndex(
        Constants.UmbracoIndexes.ExternalIndexName, out var index))
    {
        throw new InvalidOperationException(
            $"No index found with name {Constants.UmbracoIndexes.ExternalIndexName}");
    }

    // build an Examine query
    var query = index.GetSearcher().CreateQuery()
        .GroupedOr(new [] { "pageTitle", "pageContent"},
            Request.QueryString["query"].MultipleCharacterWildcard());


    var pageSize = 10;
    var pageIndex = int.Parse(Request.QueryString["page"]);
    var search = Current.PublishedContentQuery.Search(
        query,                  // pass the examine query in!
        pageIndex * pageSize,   // skip
        pageSize,               // take
        out var totalRecords);
}

<div>
    <h3>Search Results</h3>
    <ul>
        @foreach (var result in search)
        {
            <li>
                Id: @result.Content.Id
                <br/>
                Name: @result.Content.Name
                <br />
                Score: @result.Score
            </li>
        }
    </ul>
</div>

The base interface of the fluent parts of Examine’s queries are IQueryExecutor so you can just pass in your query to the method and it will work.

Recap

The IPublishedContentQuery.Search overloads are listed in the API docs, they are:

Search(String term, String culture, String indexName)
Search(String term, Int32 skip, Int32 take, out Int64 totalRecords, String culture, String indexName)
Search(IQueryExecutor query)
Search(IQueryExecutor query, Int32 skip, Int32 take, out Int64 totalRecords)

Should you always use this instead of using Examine directly? As always it just depends on what you are doing. If you need a ton of flexibility with your search results than maybe you want to use Examine’s search results directly but if you want simple and quick access to IPublishedContent results, then these methods will work great.

Does this all work with ExamineX ? Absolutely!! One of the best parts of ExamineX is that it’s completely seamless. ExamineX is just an index implementation of Examine itself so all Examine APIs and therefore all Umbraco APIs that use Examine will ‘just work’.

Filtering fields dynamically with Examine

Thu, 23 Mar 2023 15:09:58 Z

The index fields created by Umbraco in Examine by default can lead to quite a substantial amount of fields. This is primarily due in part by how Umbraco handles variant/culture data because it will create a different field per culture but there are other factors as well. Umbraco will create a “__Raw_” field for each rich text field and if you use the grid, it will create different fields for each grid row type. There are good reasons for all of these fields and this allows you by default to have the most flexibility when querying and retrieving your data from the Examine indexes. But in some cases these default fields can be problematic. Examine by default uses Lucene as it’s indexing engine and Lucene itself doesn’t have any hard limits on field count (as far as I know), however if you swap the indexing engine in Examine to something else like Azure Search with ExamineX then you may find your indexes are exceeding Azure Search’s limits.

Azure Search field count limits

Azure Search has varying limits for field counts based on the tier service level you have (strangely the Free tier allows more fields than the Basic tier). The absolute maximum however is 1000 fields and although that might seem like quite a lot when you take into account all of the fields created by Umbraco you might realize it’s not that difficult to exceed this limit. As an example, lets say you have an Umbraco site using language variants and you have 20 languages in use. Then let’s say you have 15 document types each with 5 fields (all with unique aliases) and each field is variant and you have content for each of these document types and languages created. This immediately means you are exceeding the field count limits: 20 x 15 x 10 = 1500 fields! And that’s not including the “__Raw_” fields or the extra grid fields or the required system fields like “id” and “nodeName”. I’m unsure why Azure Search even has this restriction in place

Why is Umbraco creating a field per culture?

When v8 was being developed a choice had to be made about how to handle multi-lingual data in Examine/Lucene. There’s a couple factors to consider with making this decision which mostly boils down to how Lucene’s analyzers work. The choice is either: language per field or language per index. Some folks might think, can’t we ‘just’ have a language per document? Unfortunately the answer is no because that would require you to apply a specific language analyzer for that document and then scoring would no longer work between documents. Elastic Search has a good write up about this. So either language per field or different indexes per language. Each has pros/cons but Umbraco went with language per field since it’s quite easy to setup, supports different analyzers per language and doesn’t require a ton of indexes which also incurs a lot more overhead and configuration.

Do I need all of these fields?

That really depends on what you are searching on but the answer is most likely ‘no’. You probably aren’t going to be searching on over 1000s fields, but who knows every site’s requirements are different. Umbraco Examine has something called an IValueSetValidator which you can configure to include/exclude certain fields or document types. This is synonymous with part of the old XML configuration in Examine. This is one of those things where configuration can make sense for Examine and @callumwhyte has done exactly that with his package “Umbraco Examine Config”. But the IValueSetValidator isn’t all that flexible and works based on exact naming which will work great for filtering content types but perhaps not field names. (Side note – I’m unsure if the Umbraco Examine Config package will work alongside ExamineX, need to test that out).

Since Umbraco creates fields with the same prefixed names for all languages it’s relatively easy to filter the fields based on a matching prefix for the fields you want to keep.

Here’s some code!

The following code is relatively straight forward with inline comments: A custom class “IndexFieldFilter” that does the filtering and can be applied different for any index by name, a Component to apply the filtering, a Composer to register services. This code will also ensure that all Umbraco required fields are retained so anything that Umbraco is reliant upon will still work.

/// <summary>
/// Register services
/// </summary>
public class MyComposer : ComponentComposer<MyComponent>
{
    public override void Compose(Composition composition)
    {
        base.Compose(composition);
        composition.RegisterUnique<IndexFieldFilter>();
    }
}

public class MyComponent : IComponent
{
    private readonly IndexFieldFilter _indexFieldFilter;

    public MyComponent(IndexFieldFilter indexFieldFilter)
    {
        _indexFieldFilter = indexFieldFilter;
    }

    public void Initialize()
    {
        // Apply an index field filter to an index
        _indexFieldFilter.ApplyFilter(
            // Filter the external index 
            Umbraco.Core.Constants.UmbracoIndexes.ExternalIndexName, 
            // Ensure fields with this prefix are retained
            new[] { "description", "title" },
            // optional: only keep data for these content types, else keep all
            new[] { "home" });
    }

    public void Terminate() => _indexFieldFilter.Dispose();
}

/// <summary>
/// Used to filter out fields from an index
/// </summary>
public class IndexFieldFilter : IDisposable
{
    private readonly IExamineManager _examineManager;
    private readonly IUmbracoTreeSearcherFields _umbracoTreeSearcherFields;
    private ConcurrentDictionary<string, (string[] internalFields, string[] fieldPrefixes, string[] contentTypes)> _fieldNames
        = new ConcurrentDictionary<string, (string[], string[], string[])>();
    private bool disposedValue;

    /// <summary>
    /// Constructor
    /// </summary>
    /// <param name="examineManager"></param>
    /// <param name="umbracoTreeSearcherFields"></param>
    public IndexFieldFilter(
        IExamineManager examineManager,
        IUmbracoTreeSearcherFields umbracoTreeSearcherFields)
    {
        _examineManager = examineManager;
        _umbracoTreeSearcherFields = umbracoTreeSearcherFields;
    }

    /// <summary>
    /// Apply a filter to the specified index
    /// </summary>
    /// <param name="indexName"></param>
    /// <param name="includefieldNamePrefixes">
    /// Retain all fields prefixed with these names
    /// </param>
    public void ApplyFilter(
        string indexName,
        string[] includefieldNamePrefixes,
        string[] includeContentTypes = null)
    {
        if (_examineManager.TryGetIndex(indexName, out var e)
            && e is BaseIndexProvider index)
        {
            // gather all internal index names used by Umbraco 
            // to ensure they are retained
            var internalFields = new[]
                {
                LuceneIndex.CategoryFieldName,
                LuceneIndex.ItemIdFieldName,
                LuceneIndex.ItemTypeFieldName,
                UmbracoExamineIndex.IconFieldName,
                UmbracoExamineIndex.IndexPathFieldName,
                UmbracoExamineIndex.NodeKeyFieldName,
                UmbracoExamineIndex.PublishedFieldName,
                UmbracoExamineIndex.UmbracoFileFieldName,
                "nodeName"
            }
                .Union(_umbracoTreeSearcherFields.GetBackOfficeFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeDocumentFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeMediaFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeMembersFields())
                .ToArray();

            _fieldNames.TryAdd(indexName, (internalFields, includefieldNamePrefixes, includeContentTypes ?? Array.Empty<string>()));

            // Bind to the event to filter the fields
            index.TransformingIndexValues += TransformingIndexValues;
        }
        else
        {
            throw new InvalidOperationException(
                $"No index with name {indexName} found that is of type {typeof(BaseIndexProvider)}");
        }
    }

    private void TransformingIndexValues(object sender, IndexingItemEventArgs e)
    {
        if (_fieldNames.TryGetValue(e.Index.Name, out var fields))
        {
            // check if we should ignore this doc by content type
            if (fields.contentTypes.Length > 0 && !fields.contentTypes.Contains(e.ValueSet.ItemType))
            {
                e.Cancel = true;
            }
            else
            {
                // filter the fields
                e.ValueSet.Values.RemoveAll(x =>
                {
                    if (fields.internalFields.Contains(x.Key)) return false;
                    if (fields.fieldPrefixes.Any(f => x.Key.StartsWith(f))) return false;
                    return true;
                });
            }
        }
    }

    protected virtual void Dispose(bool disposing)
    {
        if (!disposedValue)
        {
            if (disposing)
            {
                // Unbind from the event for any bound indexes
                foreach (var keys in _fieldNames.Keys)
                {
                    if (_examineManager.TryGetIndex(keys, out var e)
                        && e is BaseIndexProvider index)
                    {
                        index.TransformingIndexValues -= TransformingIndexValues;
                    }
                }
            }
            disposedValue = true;
        }
    }

    public void Dispose()
    {
        Dispose(disposing: true);
        GC.SuppressFinalize(this);
    }
}

That should give you the tools you need to dynamically filter your index based on fields and content type’s if you need to get your field counts down. This would also be handy even if you aren’t using ExamineX and Azure Search since keeping an index size down and storing less data means less IO operations and storage size.

Examine and Azure Blob Storage

Thu, 23 Mar 2023 15:09:45 Z

Quite some time ago - probably close to 2 years - I created an alpha version of an extension library to Examine to allow storing Lucene indexes in Blob Storage called Examine.AzureDirectory. This idea isn’t new at all and in fact there’s been a library to do this for many years called AzureDirectory but it previously had issues and it wasn’t clear on exactly what it’s limitations are. The Examine.AzureDirectory implementation was built using a lot of the original code of AzureDirectory but has a bunch of fixes (which I contributed back to the project) and different ways of working with the data. Also since Examine 0.1.90 still worked with lucene 2.x, this also made this compatible with the older Lucene version.

… And 2 years later, I’ve actually released a real version

Why is this needed?

There’s a couple reasons – firstly Azure web apps storage run on a network share and Lucene absolutely does not like it’s files hosted on a network share, this will bring all sorts of strange performance issues among other things. The way AzureDirectory works is to store the ‘master’ index in Blob Storage and then sync the required Lucene files to the local ‘fast drive’. In Azure web apps there’s 2x drives: ‘slow drive’ (the network share) and the ‘fast drive’ which is the local server’s temp files on local storage with limited space. By syncing the Lucene files to the local fast drive it means that Lucene is no longer operating over a network share. When writes occur, it writes back to the local fast drive and then pushes those changes back to the master index in Blob Storage. This isn’t the only way to overcome this limitation of Lucene, in fact Examine has shipped a work around for many years which uses something called SyncDirectory which does more or less the same thing but instead of storing the master index in Blob Storage, the master index is just stored on the ‘slow drive’. Someone has actually taken this code and made a separate standalone project with this logic called SyncDirectory which is pretty cool!

Load balancing/Scaling

There’s a couple of ways to work around the network share storage in Azure web apps (as above), but in my opinion the main reason why this is important is for load balancing and being able to scale out. Since Lucene doesn’t work well over a network share, it means that Lucene files must exist local to the process it’s running in. That means that when you are load balancing or scaling out, each server that is handling requests will have it’s own local Lucene index. So what happens when you scale out further and another new worker goes online? This really depending on the hosting application… for example in Umbraco, this would mean that the new worker will create it’s own local indexes by rebuilding the indexes from the source data (i.e. database). This isn’t an ideal scenario especially in Umbraco v7 where requests won’t be served until the index is built and ready. A better scenario is that the new worker comes online and then syncs an existing index from master storage that is shared between all workers …. yes! like Blob Storage.

Read/Write vs Read only

Lucene can’t be written to concurrently by multiple processes. There are some workarounds here a there to try to achieve this by synchronizing processes with named mutex/semaphore locks and even AzureSearch tries to handle some of this by utilizing Blob Storage leases but it’s not a seamless experience. This is one of the reasons why Umbraco requires a ‘master’ web app for writing and a separate web app for scaling which guarantees that only one process writes to the indexes. This is the setup that Examine.AzureDirectory supports too and on the front-end/replica/slave web app that scales you will configure the provider to be readonly which guarantees it will never try to write back to the (probably locked) Blob Storage.

With this in place, when a new front-end worker goes online it doesn’t need to rebuild it’s own local indexes, it will just check if indexes exist and to do that will make sure the master index is there and then continue booting. At this stage there’s actually almost no performance overhead. Nothing actually happens with the local indexes until the index is referenced by this worker and when that happens Examine will lazily just sync the Lucene files that it needs locally.

How do I get it?

First thing to point out is that this first release is only for Examine 0.1.90 which is for Umbraco v7. Support for Examine 1.x and Umbraco 8.x will come out very soon with some slightly different install instructions.

The release notes of this are here, the install docs are here, and the Nuget package for this can be found here.

PM> Install-Package Examine.AzureDirectory -Version 0.1.90

To activate it, you need to add these settings to your web.config

<add key="examine:AzureStorageConnString" value="YOUR-STORAGE-CONNECTION-STRING" />
<add key="examine:AzureStorageContainer" value="YOUR-CONTAINER-NAME" />

Then for your master server/web app you’ll want to add a directoryFactory attribute to each of your indexers in ExamineSettings.config, for example:

<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
      supportUnpublished="true"
      supportProtected="true"
      directoryFactory="Examine.AzureDirectory.AzureDirectoryFactory, Examine.AzureDirectory"
      analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

For your front-end/replicate/slave server you’ll want a different readonly value for the directoryFactory like:

<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
      supportUnpublished="true"
      supportProtected="true"
      directoryFactory="Examine.AzureDirectory.ReadOnlyAzureDirectoryFactory, Examine.AzureDirectory"
      analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

Does it work?

Great question :) With the testing that I’ve done it works and I’ve had this running on this site for all of last year without issue but I haven’t rigorously tested this at scale with high traffic sites, etc… I’ve decided to release a real version of this because having this as an alpha/proof of concept means that nobody will test or use it. So now hopefully a few of you will give this a whirl and let everyone know how it goes. Any bugs can be submitted to the Examine repo.

Examine hits RC1

Thu, 23 Mar 2023 15:08:14 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

I’m happy to announce that Examine and UmbracoExamine have today hit RC1!

The Codeplex site also has more extensive documentation about how to get UmbracoExamine up and running within your Umbraco website.

Go, download your copy today.

Examine’s fluent search API

Thu, 23 Mar 2023 15:08:14 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

As I mentioned in my last blog post we’ve done a lot of work to refactor Examine (and Umbraco Examine) to use a fluent search API rather than a string based search API.

The primary reason for this was to do with how we were handling the string searching and opening up the Lucene.Net search API. In the initial preview version we would take the text which you entered as a search term and then produce a Lucene.Net search against all the fields in your index. This is ok, but it’s not great. The problem came when we wanted to implement a dynamic search query. There were several different search parameters, which were to check against different fields in the index.
It was sort of possible to achieve this, but you needed to understand the internals of Examine and you also needed to understand the Lucene query language, and also that you couldn’t use the AND/ OR/ NOT operators, you had to use +, – or blank.

This is fine if you’re into search API’s, but really, how many people are actually like that? Ok, I must admit that I’m rather smitten with Lucene but I’m not exactly a good example of a normal person..

So I set about addressing this problem, we needed to get a much simpler way in which your average Joe could come and without knowing the underlying technology write complex and useful search queries.
For this we’ve build a set of interfaces which you require:

ISearchCriteria
IQuery
IBooleanOperation

ISearchCriteria

The ISearchCriteria interface is the real workhorse of the API, it’s the first interface you start with, and it’s the last interface you deal with. In fact, ISearchCriteria implements IQuery, meaning that all the query operations start here.

In addition to query operations there are several additional properties for such as the maximum number of results and the type of data being searched.

Because ISearchCriteria is tightly coupled with the BaseSearchProvider implementation it is actually created via a factory pattern, like so:

ISearchCriteria searchCriteria = ExamineManager.Instance.SearchProviderCollection["MySearcher"].CreateSearchCriteria(100, IndexType.Content);

What we’re doing here is requesting that our BaseSearchProvider creates an instance of an ISearchCriteria. It takes two parameters:

int maxResults
Examine.IndexType indexType

This data can/ should be then used by the search method to return what’s required.

IQuery

The IQuery interface is really the heart of the fluent API, it’s what you use to construct the search for your site. Since Examine is designed to be technology agnostic the methods which are exposed via IQuery are fairly generic. A lot of the concepts are borrowed from Lucene.Net, but they are fairly generic and should be viable for any searcher.

The IQuery API exposes the following methods:

IBooleanOperation Id(int id);
IBooleanOperation NodeName(string nodeName);
IBooleanOperation NodeName(IExamineValue nodeName);
IBooleanOperation NodeTypeAlias(string nodeTypeAlias);
IBooleanOperation NodeTypeAlias(IExamineValue nodeTypeAlias);
IBooleanOperation ParentId(int id);
IBooleanOperation Field(string fieldName, string fieldValue);
IBooleanOperation Field(string fieldName, IExamineValue fieldValue);
IBooleanOperation MultipleFields(IEnumerable<string> fieldNames, string fieldValue);
IBooleanOperation MultipleFields(IEnumerable<string> fieldNames, IExamineValue fieldValue);
IBooleanOperation Range(string fieldName, DateTime start, DateTime end);
IBooleanOperation Range(string fieldName, DateTime start, DateTime end, bool includeLower, bool includeUpper);
IBooleanOperation Range(string fieldName, int start, int end);
IBooleanOperation Range(string fieldName, int start, int end, bool includeLower, bool includeUpper);
IBooleanOperation Range(string fieldName, string start, string end);
IBooleanOperation Range(string fieldName, string start, string end, bool includeLower, bool includeUpper);

As you can see all the methods within the IQuery interface return an IBooleanOperator, this is how the fluent API works!

Hopefully it’s fairly obvious what each of the methods are, but the one you’re most likely to use is Field. Field allows you to specify any field in your index, and then provide a word to lookup within that field.

IExamineValue

You’ve probably noticed the IExamineValue parameter which is passable to a lot of the different methods, methods which take a string, but what is IExamineValue?
Well obviously it’s some-what provider dependant, so I’ll talk about it as part of Umbraco Examine, as that’s what I think most initial uptakers will want.

Because Lucene supports several different term modifiers for text we decided it would be great to have those exposed in the API for people to leverage. For this we’ve got a series of string extension methods which reside in the namespace

UmbracoExamine.SearchCriteria

So once you add a using statement for that you’ll have the following extension methods:

public static IExamineValue SingleCharacterWildcard(this string s)
public static IExamineValue MultipleCharacterWildcard(this string s)
public static IExamineValue Fuzzy(this string s)
public static IExamineValue Fuzzy(this string s, double fuzzieness)
public static IExamineValue Boost(this string s, double boost)
public static IExamineValue Proximity(this string s, double proximity)
public static IExamineValue Excape(this string s)
public static string Then(this IExamineValue vv, string s)

All of these (with the exception of Then) return an IExamineValue (which UmbracoExamine internally handles), and it tells Lucene.Net how to handle the term modifier you required.

I wont repeat what is said within the Lucene documentation, I suggest you read that to get an idea of what to use and when.
The only exceptions are Escape and Then.

Escape

If you’re wanting to search on multiple works together then Lucene requires them to be ‘escaped’, otherwise it’ll (generally) treat the space character as a break in the query. So if you wanted to search for Umbraco Rocks and didn’t escape it you’d match on both Umbraco and Rocks, where as when it’s escaped you’ll then match on the two words in sequence.

Then

The Then method just allows you to combine multiple strings or multiple IExamineValues, so you can boost your fuzzy query with a proximity of 0.1 :P.

IBooleanOpeation

IBooleanOperation allows your to join multiple IQuery methods together using:

IQuery And()
IQuery Or()
IQuery Not()

These are then translated into the underlying searcher so it can determine how to deal with your chaining. At the time of writing we don’t support nested conditionals (grouped OR’s operating like an And).

There’s another method on IBooleanOperation which doesn’t fall into the above, but it’s very critical to the overall idea:

ISearchCriteria Compile()

The Compile method will then return an ISearchCriteria which you then pass into your searcher. It’s expected that this is the last method which is called and it’s meant to prepare all search queries for execution.
The reason we’re going with this rather than passing the IQuery into the Searcher is that it means we don’t have to have the max results/ etc into every IQuery instance, it’s not something that is relevant in that scope, so it’d just introduce code smell, and no one wants that.

Bringing it all together

So now you know the basics, how do you go about producing a query?

Well the first thing you need to do is get an instance of your ISearchCriteria:

var sc = ExamineManager.Instance.CreateSearchCriteria();

Now lets do a search for a few things across a few different fields:

var query = sc.NodeName("umbraco").And().Field("bodyText", "is awesome".Escape()).Or().Field("bodyText", "rock".Fuzzy());

Now we’ve got a query across a few different fields, lastly we need to pass it to our searcher:

var results = ExamineManager.Instance.Search(query.Compile());

It’s just that simple!

Hopefully the fluent API is clean enough that people can build nice and complex queries and are able to search their websites with not problem. If you’ve got any feedback please leave it here, as we’re working to get an RC out soon.

Examine RC2 posted

Thu, 23 Mar 2023 15:08:14 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

I’ve just released Examine RC2 into the while, you can download it from our CodePlex site.

RC2 fixes a bug in RC1 which wasn’t indexing user fields, only attribute fields.

There’s a few breaking changes with RC2:

IQuery.MultipleFields has been removed. Use IQuery.GroupedAnd, IQuery.GroupedOr, IQuery.GroupedNot or IQuery.GroupedFlexible to define how multiple fields are added
ISearchCriteria.RawQuery added which allows you to pass a raw query string to the underlying provider
ISearcher.Search returns a new interface ISearchResults (which inherits IEnumerable<SearchResult>)
New interface ISearchResults which exposes a Skip to support paging and TotalItemCount

Will be working on more documentation to explain some of the newly added and obscure features shortly :P.

Examine, but not as you knew it

Thu, 23 Mar 2023 15:08:14 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

Almost 12 months ago Shannon blogged about Umbraco Examine a Lucene.NET indexer which works nicely with Umbraco 4.x. Since then we’ve done quite a bit of work on Examine, and as people will may be aware we’ve integrated Examine into the Umbraco core and it will be shipped out of the box with Umbraco 4.1.

Something Shannon and I had discussed a few times was that we wanted to decouple Examine from Umbraco so it could be used for indexing on sites other than Umbraco.
You’ll also notice that I keep referring to it as Examine, not Umbraco Examine which most people are more familiar with.
This is because over the last week we have achieved what we’d wanted to do, we’ve decoupled Examine from Umbraco!

So what’s Examine?

Examine is a provider based, config driven search and indexer framework. Examine provides all the methods required for indexing and searching any data source you want to use.

Examine is now agnostic of the indexer/ searcher API, as well as the data source. That’s right Examine has no references within itself to Umbraco, nor does it have any references to Lucene.NET.
We have still maintained a usage of XML internally for passing the data-to-index around, as it’s the easiest construct which we could think to work with and pass around.

You could implement the Examine framework in any solution, to index any data you want, it could be from a SQL server, or it could be from web-scraped content.

Where does that leave Umbraco Examine?

Umbraco Examine still exists, in fact it’s the primary (and currently only) implementer of Examine. Over the last week though we’ve done a lot of refactoring of Umbraco Examine to work with some changes we’ve done to the underlying Examine API.

Changes? What changes?

Last week anyone who follows me on Twitter will have seen a lot of tweets around Umbraco Examine which was about a new search API and the breaking changes we were implementing.

While looking to refactor the underlying API of a large Umbraco site we have running I found that Examine was actually not properly designed if you wanted to search for data in specific fields, or build complex search queries.

This was a real bugger, I had many different parameters I needed to optionally search on, and only in certain fields, but since Umbraco Examine works with just a raw string this wasn’t possible.

So I set about creating a new fluent search API. This has actually turned out quite well, in fact so well that we new have this as the recommended search method, not raw text (which is still available).

The fluent API is part of the Examine API so it’s also available for any implementation, not just Umbraco! Since we’ve used Lucene.NET as the initial support model the API is designed similarly to what you’d expect from Lucene.NET, but we hope that it’s generic enough to look and feel right for any indexer/ searcher.

Here’s how the fluent API looks:

searchCriteria
.Id(1080)
.Or()
.Field("headerText", "umb".Fuzzy())
.And()
.NodeTypeAlias("cws".MultipleCharacterWildcard())
.Not()
.NodeName("home");

All you have to do is pass that into your searcher. That easy, and that beautiful. I’ll do a blog post where we’ll look more deeply into the fluent API separately.

Additionally we’ve done some other changes, because of what the framework new is we’ve renamed our assemblies and namespaces:

Examine.dll

This was formally UmbracoExamine.Core.dll
Root namespace Examine
Contains all the classes and methods to create your own indexer and searcher

UmbracoExamine.dll

This was formally UmbracoExamine.Providers.dll
Root namespace UmbracoExamine.dll
Contains all the classes and methods of an Umbraco & Lucene.NET

Apologies to any existing implementations of Umbraco Examine, this will result in breaking changes but since we’ve not hit RC yet too bad :P.

There are also some changes to the config, <IndexUserFields /> has become <IndexStandardFields />, and obviously the config registrations are different with the assembly and namspace changes.

The last change is that we’ve moved to the Ms-PL license for Examine, whos source is available on codeplex.

Currently we’re working to tidy up the API and the documentation so that we can get the RC release out shortly, so watch this space.

Examine’s Fluent Search API – Elevator Pitch

Thu, 23 Mar 2023 15:08:14 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

I realised that with my blog post about Examine it was fairly in-depth and a lot of people were probably bored before they got to the good bits about how easy searching can be.
So I decided that a smaller, more concise post was in order.

What?

The Fluent Search API is a chainable (like jQuery) API for building complex searches for a data source, in this case Umbraco. It doesn’t require you to know any “search language”, it just works via standard .NET style method calls, with intellisense to help guide you along the way.

How?

This is achieved by combining IQuery methods (search methods) with IBooleanOperation methods (And, Or, Not) to produce something cool. For example:

var query = sc
	.NodeName("umbraco")
	.And()
	.Field("bodyText", "is awesome".Escape())
	.Or()
	.Field("bodyText", "rock".Fuzzy());

Examineness can be implemented to do special things to search text, like making it a wild card query, or escaping several terms to have them used as a search sentence.

Hopefully this more direct post will engage your attention better and make you want more Examine sexiness.

How to build a search query in Examine

Thu, 23 Mar 2023 15:08:13 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

Now that Examine is able to be used by a wider audience than Umbraco understanding how search works is possibly a bit more important ;).

So today while answering a question on the Umbraco forum I thought that what I was going on about is something that more people might want to hear about. And really, I do like the sound of my own (virtual) voice…

Understanding Lucene.Net from Examine

For this I’m going to be looking at the Lucene.Net implementation of Examine, and this is agnostic of whether you’re using UmbracoExamine or Examine.LuceneEngine.

To get started you should familiarize yourself a bit with the Lucene Query Parser Syntax, as that’s what we’re using internally to get the data back from Lucene.Net.

Also, we’re going to be working with the Fluent API for Examine, so fell free to read up on that here and here. With Examine we’ve made it easy to see what the Lucene query you’re building up is as you’re working with the Fluent API, in fact if you to a .ToString() call on the ISearchCriteria instance you’ll be able to see what you’re search query looks like (we’ve also got some other information in the result of that method call too).

So you can see what’s been generated, let’s build a query and dissect it:

var criteria = searcher.CreateSearchCriteria(IndexTypes.Content);
criteria = criteria.NodeName("Hello").And().NodeTypeAlias("world").Compile();

Console.WriteLine(criteria.ToString()); //+(+nodeName:Hello +nodeTypeAlias:world) +__IndexType:content

So this is what we've got, a total of three conditions… But wait, there’s only two conditions that I entered right?

Not quite, Examine has some smarts built into it around what you’re searching on and it’ll add that restriction on your behalf. This is so you don’t get results back from a different index type in your query. Note: If you don’t specify the IndexType then this condition wont be added for your.

Also, to ensure that all your entered queries don’t get killed by the IndexType (a problem in earlier Examine builds) we combine everything you enter into a GroupedAnd statement.

Let’s have a look at the parts we did request, and how they are comprised:

+nodeName:Hello +nodeTypeAlias:world

Ok, so what we've got here is our two conditions which we've added using Examine. Each is an AND (or in Lucene terminology SHOULD) and this is denoted by the +. Next we have a field name, in this case either nodeName or nodeTypeAlias. Youl’ll notice back in our Fluent API query we actually generated all of that using the build in methods, rather than having to use more magic strings. Next there is a : to indicate where the field name ends. Lastly we have the term which we’re going to search against, either Hello or world.

So essentially it’s built up of BOOLEAN_OPERATION&FieldName:Term (the & is so it’s slightly readable).

Conclusion

This was a brief look at how the Fluent API for Examine will turn your typed query into a Lucene query that is then searching.

With this knowledge you should be better able to design complex queries, use mixed conditionals and just plain go crazy.

Paging with Examine

Thu, 23 Mar 2023 15:08:13 Z

Paging with Lucene and Examine requires some specific API usage. It's very easy to get wrong by using Linq's Skip/Take methods and when doing this you'll inadvertently end up loading in all search results from Lucene and then filtering in memory when what you really want to do is have Lucene only create the minimal search result objects that you are interested in.

There are 2 important parts to this:

The Skip method on the ISearchResults object
The Search overload on the BaseSearchProvider where you can specify maxResults

ISearchResults.Skip

This is very different from the Linq Skip method so you need to be sure you are using the Skip method on the ISearchResults object. This tells Lucene to skip over a specific number of results without allocating the result objects. If you use Linq’s Skip method on the underlying IEnumerable<SearchResult> of ISearchResults, this will allocate all of the result objects and then filter them in memory which is what you don’t want to do.

Search with maxResults

Lucene isn’t perfect for paging because it doesn’t natively support the Linq equivalent to “Skip/Take”. It understands Skip (as above) but doesn’t understand Take, instead it only knows how to limit the max results so that it doesn’t allocate every result, most of which you would probably not need when paging.

With the combination of ISearchResult.Skip and maxResults, we can tell Lucene to:

Skip over a certain number of results without allocating them and tell Lucene
only allocate a certain number of results after skipping

Show me the code

//for example purposes, we want to show page #4 (which is pageIndex of 3)
var pageIndex = 3;   
//for this example, the page size is 10 items
var pageSize = 10;
var searchResult = searchProvider.Search(criteria, 
   //don't return more results than we need for the paging
   //this is the 'trick' - we need to load enough search results to fill
   //all pages from 1 to the current page of 4
   maxResults: pageSize*(pageIndex + 1));
//then we use the Skip method to tell Lucene to not allocate search results
//for the first 3 pages
var pagedResults = searchResult.Skip(pageIndex*pageSize);
var totalResults = searchResult.TotalItemCount;

So that is the correct way to do paging with Examine and Lucene which ensures max performance and minimal object allocations.

Using Examine to index & search with ANY data source

Thu, 23 Mar 2023 15:08:13 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

During CodeGarden 2010 a few people were asking how to use Examine to index and search on data from any data source such as custom database tables, etc… Previously, the only way to do this was to override the Umbraco Examine indexing provider, remove the Umbraco functionality embedded in there, and then do a lot of coding yourself. …But now there’s some great news! As of now you can use all of the Examine goodness with it’s embedded Lucene.Net with any data source and you can do it VERY easily.

Some things you need to know about the new version:

I haven’t made a release version of this yet as it still needs some more testing, though we are putting this into a production site next week.
If you want to try this, currently you’ll need to get the latest source from Examine @ CodePlex
If you are using a previous version of Examine, there’s a few breaking changes as some of the class structures have been moved, however you config file should still work as is… HOWEVER, you should update your config file to reflect the new one with the new class names
There is now 3 DLLs, not just 2:
- Examine.DLL
  - Still pretty much the same… contains the abstraction layer
- Examine.LuceneEngine.DLL
  - The new DLL to use to work with data that is not Umbraco specific
- UmbracoExamine.DLL
  - The DLL that the Umbraco providers are in

Ok, now on to the good stuff. First, I’ve added a demo project to this post which you can download HERE. This project is a simple console app that contains a sample XML data file that has 5 records in it. Here’s what the app does:

This re-indexes all data
Searches the index for node id 1
Ensures one record is found in the index
Updates the dateUpdated time stamp for the data record
Re-indexes the record with node id 1’

So assuming that you have some custom data like a custom database table, xml file, or whatever, there’s really only 3 things that you need to do to get Examine indexing your custom data:

Create your own ISimpleDataService
- There is only 1 method to implement: IEnumerable<SimpleDataSet> GetAllData(string indexType)
- This is the method that Examine will call to re-index your data
- A SimpleDataSet is a simple object containing a Dictionary<string, string> and a IndexedNode object (which consists of a Node Id and a Node Type)
- For example, if you had a database row, your SimpleDataSet object for the row would be the dictionary of the rows values, it’s node id and type … easy.
Use the ToExamineXml() extension method to re-index individual nodes/records
- Examine relies on data being in the same XML structure as Umbraco (which we might change in version 2 sometime in the future… like next year) so we need to transform simple data into the XML structure. We’ve made this quite easy for you; all you have to do is get the data from your custom data source into a Dictionary<string, string> object and use this extension method to pass the xml structure in to Examine’s ReIndexNode method.
- For example: ExamineManager.Instance.ReIndexNode(dataSet.ToExamineXml(dataSet["Id"], "CustomData"), "CustomData"); where dataSet is a Dictionary<string, string> .
Update your Examine config to use the new SimpleDataIndexer index provider and the new LuceneSearcher search provider

If you’re not using Umbraco at all, then you’ll only need to have the 2 Examine DLLs which don’t reference the Umbraco DLLs whatsoever so everything is decoupled.

I’d recommend downloading the demo app and running it as it will show you everything you need to know on how to get Examine running with custom data. However, i know that people just like to see code in blog posts, so here’s the config for the demo app:

<?xml version="1.0" encoding="utf-8" ?>
<configuration>

  <configSections>
    <section name="Examine" type="Examine.Config.ExamineSettings, Examine"/>
    <section name="ExamineLuceneIndexSets" 
             type="Examine.LuceneEngine.Config.IndexSets, Examine.LuceneEngine"/>
  </configSections>

  <Examine>
    <ExamineIndexProviders>
      <providers>

        <!-- 
        Define the indexer for our custom data.
        Since we're only indexing one type of data, there's 
        only 1 indexType specified: 'CustomData', however
        if you have more than one type of index (i.e. Media, Content) 
        then you just need to list them as a comma seperated list without spaces.
        
        The dataService is how Examine queries whatever data source you have, 
        in this case it's a custom data service defined in this project.
        A custom data service only has to implement one method... very easy.        
        -->
        <add name="CustomIndexer"
             type="Examine.LuceneEngine.Providers.SimpleDataIndexer, 
                Examine.LuceneEngine"
             dataService="ExamineDemo.CustomDataService, ExamineDemo"
             indexTypes="CustomData"
             runAsync="false"/>

      </providers>
    </ExamineIndexProviders>
    <ExamineSearchProviders defaultProvider="CustomSearcher">
      <providers>
     
        <!-- 
        A search provider that can query a lucene index, no other 
        work is required here 
        -->
        <add name="CustomSearcher"
             type="Examine.LuceneEngine.Providers.LuceneSearcher, 
                Examine.LuceneEngine" />

      </providers>
    </ExamineSearchProviders>
  </Examine>

  <ExamineLuceneIndexSets>

    <!-- Create an index set to hold the data for our index -->
    <IndexSet SetName="CustomIndexSet" 
              IndexPath="App_Data\CustomIndexSet">
      <IndexUserFields>        
        <add Name="name" />
        <add Name="description" />
        <add Name="dateUpdated" />
      </IndexUserFields>
    </IndexSet>
    
  </ExamineLuceneIndexSets>

</configuration>

Examine slide deck for CodeGarden 2010

Thu, 23 Mar 2023 15:08:13 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

A few people had asked during CodeGarden 2010 if I would post up the slide deck for my Examine presentation, so here it is. There’s not a heap of information in there since i think people would have soaked up most of the info during the examples and coding demos but it’s posted here regardless and hopefully it helps a few people.

I’ve included a PDF version (link at the bottom) and also the image version below (if you’re too lazy to download it :)

Download slide deck here

Examine v1.0 RTM

Thu, 23 Mar 2023 15:08:09 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

We finally released Examine version 1.0 a week or so ago. You can find the latest download package from the CodePlex downloads page for Examine: http://examine.codeplex.com/releases/view/50781

Here’s what you’ll need to know

There are some breaking changes from the version that is shipped with Umbraco 4.5 and also from the Examine RC3 release. The downloads tab on CodePlex contains the Release Notes for download which contains all of the information on upgrading & breaking changes
READ THE RELEASE NOTES BEFORE UPGRADING
There’s a ton of bugs fixed in this release from the version shipped with Umbraco 4.5
Lots of new features have been added:
- Indexing ANY type of data easily using the LuceneEngine index/search providers
- PDF Indexing for Umbraco
- XSLT extensions for Umbraco
- Data Type declarations for indexed fields
- Date & Number range searching
New documentation has been added to CodePlex

Using v1.0 RTM on Umbraco 4.5

The upgrade process from the Examine version shipped with 4.5 to v1.0 RTM should be pretty seamless (unless you are using some specific API calls as noted in the release notes). However, once you drop in the new DLLs you’ll probably notice that the internal search no longer works. This is due to a bug in the Umbraco 4.5. codebase and an non-optimal implementation of Examine which has to do with case sensitivity for application aliases (i.e. Content vs content ). The work-around is simple though: all we need to do is change the Analyzer used for the internal searcher in the Examine configuration file to use the StandardAnalyzer instead of the WhitespaceAnalyzer. This is because the WhitespaceAnalyzer is case sensitive whereas the StandardAnalyzer is not. This issue is fixed in Umbraco Juno (4.6) and will continue to use the WhitespaceAnalyzer so that Examine doesn’t tokenize strings that contain punctuation. For more info on Analyzers, have a look at Aaron’s post.

Next Versions

There probably won’t be too many more changes coming for Examine v1.0 apart from any bug fixing that needs to be done and maybe some tweaks to the Fluent API. We will start working on v2.0 at some point this year or early next year which will take Examine to the next level. It will be less focused on configuration, have a smaller foot print and be much more configurable through code (such as how ASP.Net MVC works).

Examine output indexing

Thu, 23 Mar 2023 15:08:09 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

Last week Pete Gregory (@pgregorynz) and I were discussing different implementations of Examine. Particularly when you need to use Examine events to collate information from different nodes to put into the index for the page being rendered. An example of this is an FAQ engine where you might have an Umbraco content structure such as:

Site Container
- Public
  - FAQs
    - FAQ Item 1
    - FAQ Item 2
    - FAQ Item 3

In this example, the page that is rendered to the end user is FAQs but the data from all 4 nodes (FAQs, FAQ Item 1 –> 4) needs to be added to the index for the FAQs page. To do this you can use Examine events, either using the GatheringNodeData of the BaseIndexProvider, or by using the DocumentWriting event of the UmbracoContentIndexer (I’ll write another post covering the difference between these two events and why they both exist). Though writing Examine event handlers to put the data from FAQ Item 1 –> 4 into the FAQs index isn’t very difficult, it would still be really cool if all of this could be done automatically.

Pete mentioned it would be cool if we could just index the output html of a page (sort of like Google) and suddenly the ideas started to flow. This concept is actually quite easy to do so within the next month or so we’ll probably release a beta of Examine Output Indexing. Here’s the way it’ll all get put together:

An HttpModule will be created to do 2 things:
- Check if the current request is an Umbraco page request
  - If it is, we can easily get the current node being rendered since it’s already been added to the HttpContext items by Umbraco
  - Use the standard Examine handlers to enter the node’s data into the indexes based on the configuration you’ve specified in your Examine configuration files
- Get the HTML output of the page before it is rendered to the end user, parse the html to get the relevant data and put it into the index for the current Umbraco page
We figured that it would also be cool to have an Examine node property that developers could defined called something like: examineNoIndex which we could check for when we determine that it’s an Umbraco page and if this property is set to true, we’ll not index this page.
- This could give developers more control over what specific pages shouldn’t be indexed based directly from the CMS properties instead of writing custom events

With the above, a developer will simply need to put the HttpModule in their web.config, define an Examine index based on a new provider we create and that’s it. There will be no need to manually collate node data such as the above FAQ example. However, please note that this will work for straight forward searching so if you have complex searching & indexing requirements, I would still recommend using events since you have far more control over what information is indexed.

Any feedback is much appreciated since we haven’t started developing this quite yet.

Examine RC3 Released

Thu, 23 Mar 2023 15:08:09 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

Hopefully this will be a quick RC! I’m really hoping to release v1.0 RTM by early next week (latest). If you are able to help out with some testing it would be amazing!!

Here's what's new:

PDF Indexing
Easily implement custom data indexing outside of Umbraco
More XSLT Extensions for Umbraco
Some framework refactoring so a new DLL: Examine.LuceneEngine.dll which contains all of the Lucene.Net implementation
- Because of this refactoring, if you've built your own providers, you may need to update our code to work, otherwise it is backwards compatible for most people.
More unit tests
More documentation

Get it while it’s hot! And don’t forget to read the release notes.

DOWNLOAD FROM CODEPLEX HERE

Searching Multi-Node Tree Picker data (or any collection) with Examine

Thu, 23 Mar 2023 15:08:09 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

With the release of uComponents recently a lot of people are starting to work with a new data type called the MultiNodeTreePicker, and with this I’ve seen a few questions around searching the data it generates using Examine.

The problem is there is a catch, if you’re using the CSV storage (which you must if you’re working with Examine) you’ll hit a problem, the Examine index will have something like this:

1011,1231,1232,1225

But how do you search on that? Searching for ‘1231’ will not return anything, because it’s prefixed with ‘,’ and postfixed with ‘,’. So this brings a problem, how do you search?

Bring on Events

As Shannon spoke about at CodeGarden 10 Examine has a number of different events you can hook into to do different things (slides and code) and this is what we’re going to need to work with.

I’ve touched on events before but this time we’re going to look at a different event, we’re going to look at the GatheringNodeData event.

GatheringNodeData event

So this event in Examine is fired while Examine is scraping the data out of an XML element which it has received. This XML could be from Umbraco (in the scenario we’re looking at here) or it could be from your own data source, and the event is raised once Examine as turned the XML into a Key/ Value representation of it.

The event that raises has custom event arguments, which has a property called Fields. This Fields property is a dictionary which contains the full Key/ Value representation of the data which will end up in Examine!

Now this dictionary is able to be manipulated, so you can add/ remove data as you see if (but that’s a topic for another blog), it also means you can change the data!

Changing the data for our needs

As I mentioned at the start of this we end up with comma-separated string from the datatype which isn’t useful for searching, so we can use an event handler to change what we’ve got. First we need to tie an event handler

public class ExamineEvents : ApplicationBase 
{
	public ExamineEvents() 
	{
		var indexer = ExamineManager.Instance.IndexProviderCollection["MyIndexer"];
		indexer.GatheringNodeData += new EventHandler(GatheringNodeDataHandler);
	}

	void GatheringNodeDataHandler(object sender, IndexingNodeDataEventArgs e)
	{
		//do stuff here
	}
}

So this is just a simple wire-up, using the ApplicationBase class in Umbraco so that it’ll be created on application start-up. Next we need to implement the event handler:

void GatheringNodeDataHandler(object sender, IndexingNodeDataEventArgs e)
{
	//grab the current data from the Fields collection
	var mntp = e.Fields["TreePicker"];
	//let's get rid of those commas!
	mntp = mntp.Replace(",", " ");
	//now put it back into the Fields so we can pretend nothing happened!
	e.Fields["TreePicker"] = mntp;
}

And you’re done! Now the data will be written into the index with spaces rather than commas meaning that you can search on each ID without the need for wildcards or any other “hacks” to get it to work.

Note: This will work in the majority of cases, the only reason it’ll fail is if you’re using an analyzer that strips out numbers before indexing. For more information about Lucene analyzers take a look at this article: http://www.aaron-powell.com/lucene-analyzer

Text casing and Examine

Thu, 23 Mar 2023 15:08:09 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

A few times I’ve seen questions posted on the Umbraco forums which ask how to deal with case insensitivity text with Examine, and it’s also something that we’ve had to handle a few times within our own company.

Here’s a scenario:

You have a site search
You use examine
You want to show the results looking exactly the same as it was before it went into Examine

If you’re running a standard install you’ll notice that the content always ends up lowercased!

This is a bit of a problem, page titles will be lowercase, body content will be lowercase, etc. Part of this will be due to a mistake in Examine, part of it is due to the design of Lucene.

In this article I’ll have a look at what you need to do to make it work as you’d expect.

First, some background

Before we dive directly into what to do to fix it you really should understand what is happening. If you don’t care feel free to skip over this bit though :P.

Searching is a tricky thing, and when searching the statement Examine == examine = false; To get around this searching is best done in a case insensitive manner. To make this work Examine did a forced lowercase of the content before it was pushed into Lucene.Net. This was to ensure that everything was exactly the same when it was searched against.
In hindsight this is not really a great idea, it really should be the responsibility of the Lucene Analyzer to handle this for you.

Many of the common Lucene.Net analyzers actually do automatic lowercasing of content, these analysers are:

StandardAnalyzer
StopAnalyzer
SimpleAnalyzer

So if you’re using the standard Examine config you’ll find yourself using the StandardAnalyzer and still have your content lowercased.

This means that there’s no need to Lucene to concern itself about case sensitivity when searching, everything is parsed by the analyzer (field terms and queries) and you’ll get more matches.

So how do I get around this?

Now that we’ve seen why all your content is generally lower case, how can we work with it in the original format and display it back to the UI?

Well we need some way in which we can have the field data stored without the analyzer screwing around with it.

Note: This doesn’t need to be done if you’re using an analyzer which doesn’t have a LowerCaseTokenizer or LowercaseFilter. If you’re using a different analyzer, like KeywordAnalyzer then this post wont cover what you’re after (since the KeywordAnalyzer isn’t lowercasing, you’re actually using an out-dated version of Examine, I recommend you grab the latest release :)). More information on Analyzers can be found at http://www.aaron-powell.com/lucene-analyzer

Luckily we’ve got some hooks into Examine to allow us to do what we need here, it’s in the form of an event on the Examine.LuceneEngine.Providers.LuceneIndexer, called DocumentWriting. Note that this event is on the LuceneIndexer, not the BaseIndexProvider. This event is Lucene.Net specific and not logical on the base class which is agnostic of any other framework.

What we can do with this event is interact directly with Lucene.Net while Examine is working with it.
You’ll need to have a bit of an understanding of how to work with a Lucene.Net Document (and for that I’d recommend having a read of this article from me: http://www.aaron-powell.com/documents-in-lucene-net), cuz what you’re able to do is play with Lucene.Net… Feel the power!

So we can attach the event handler the same way as you would do any other event in Umbraco, using an Action Handler:

public class UmbracoEvents : ApplicationBase
{
	public UmbracoEvents()
        {
            var indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["DefaultIndexer"];

            indexer.DocumentWriting +=new System.EventHandler(indexer_DocumentWriting);
        }
}

To do this we’ve got to cast the indexer so we’ve got the Lucene version to work with, then we’re attaching to our event handler. Let’s have a look at the event handler

void indexer_DocumentWriting(object sender, DocumentWritingEventArgs e)
{
	//grab out lucene document from the event arguments
	var doc = e.Document;

	//the e.Fields dictionary is all the fields which are about to be inserted into Lucene.Net
	//we'll grab out the "bodyContent" one, if there is one to be indexed
	if(e.Fields.ContainsKey("bodyContent")) 
	{
		string content = e.Fields["bodyContent"];
		//Give the field a name which you'll be able to easily remember
		//also, we're telling Lucene to just put this data in, nothing more
		doc.Add(new Field("__bodyContent", content, Field.Store.YES, Field.Index.NOT_ANALYZED));
	}
}

And that’s how you can push data in. I’d recommend that you do a conditional check to ensure that the property you’re looking for does exist in the Fields property of the event args, unless you’re 100% sure that it appears on all the objects which you’re indexing.

Lastly we need to display that on the UI, well it’s easy, rather accessing the bodyContent property of the SearchResults, use the __bodyContent and you’ll get your unanalyzed version.

Conclusion

Here we’ve looked at how we can use the Examine events to interact with the Lucene.Net Document. We’ve decided that we want to push in unanalyzed text, but you could use this idea to really tweak your Lucene.Net document. But really playing with the Document is not recommended unless you *really* know what you’re doing ;).

Searching Umbraco using Razor and Examine

Thu, 23 Mar 2023 15:08:09 Z

This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.

Since Razor is really just c# it’s super simple to run a search in Umbraco using Razor and Examine. In MVC the actual searching should be left up to the controller to give the search results to your view, but in Umbraco 4.6 + , Razor is used as macros which actually ‘do stuff’. Here’s how incredibly simple it is to do a search:

@using Examine;

@* Get the search term from query string *@
@{var searchTerm = Request.QueryString["search"];}

<ul class="search-results">
@foreach (var result in 
    ExamineManager.Instance.Search(searchTerm, true)) {    
    
    <li>
        <span>@result.Score</span>
        <a href="@umbraco.library.NiceUrl(result.Id)">
            @result.Fields["nodeName"]
        </a>        
    </li>    
    
}
</ul>

That’s it! Pretty darn easy.

And for all you sceptics who think there’s too much configuration involved to setup Examine, configuring Examine requires 3 lines of code. Yes its true, 3 lines, that’s it. Here’s the bare minimum setup:

1. Create an indexer under the ExamineIndexProviders section:

<add name="MyIndexer"
      type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"/>

2. Create a searcher under the ExamineSearchProviders section:

<add name="MySearcher"
      type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"/>

3. Create an index set under the ExamineLuceneIndexSets config section:

<IndexSet SetName="MyIndexSet" 
    IndexPath="~/App_Data/TEMP/MyIndex" />

This will index all of your data in Umbraco and allow you to search against all of it. If you want to search on specific subsets, you can use the FluentAPI to search and of course if you want to modify your index, there’s much more you can do with the config if you like.

With Examine the sky is the limit, you can have an incredibly simple index and search mechanism up to an incredibly complex index with event handlers, etc… and a very complex search with fuzzy logic, proximity searches, etc… And no matter what flavour you choose it is guaranteed to be VERY fast and doesn’t matter how much data you’re searching against.

I also STRONGLY advise you to use the latest release on CodePlex: http://examine.codeplex.com/releases/view/50781 . There will also be version 1.1 coming out very soon.

Enjoy!

Examine 1.5.1 released

Thu, 23 Mar 2023 15:08:08 Z

I’ve created a new release of Examine today, version 1.5.1. There’s nothing really new in this release, just a bunch of bug fixes. The other cool thing is that I’ve finally got Examine on Nuget now. The v1.5.1 release page is here on CodePlex with upgrade instructions… which is really just replacing the DLLs.

Its important to note that if you have installed Umbraco 6.0.1+ or 4.11.5+ then you already have Examine 1.5.0 installed (which isn’t an official release on the CodePlex page) which has 8 of these 10 bugs fixed already.

Bugs fixed

Here’s the full list of bugs fixed in this release:

UmbracoExamine

You may already know this but we’ve moved the UmbracoExamine libraries in to the core of Umbraco so that the Umbraco core team can better support the implementation. That means that only the basic Examine libraries will continue to exist @ examine.codeplex.com. The release of 1.5.1 only relates to the base Examine libraries, not the UmbracoExamine libraries, but that’s ok you can still upgrade these base libraries without issue.

Nuget

There’s 2 Examine projects up on Nuget, the basic Examine package and the Azure package if you wish to use Azure directory for your indexes.

Standard package:

PM> Install-Package Examine

Azure package:

PM> Install-Package Examine.Azure

Happy searching!

New Examine updates and features for Umbraco

Thu, 23 Mar 2023 15:08:08 Z

It’s been a long while since Examine got some much needed attention and I’m pleased to say it is now happening. If you didn’t know already, we’ve moved the Umbraco Examine source in to the core of Umbraco. The underlying Examine (Examine.dll) core will remain on CodePlex but all the Umbraco bits and pieces which is found in UmbracoExamine.dll are in the Umbraco core from version 6.1+. This is great news because now we can all better support the implementation of Examine for Umbraco. More good news is that even versions prior to Umbraco 6.1 will have some bugs fixed (http://issues.umbraco.org/issue/U4-1768) ! Niels Kuhnel has also jumped aboard the Examine train and is helping out a ton by adding his amazing ‘facet’ features which will probably make it into an Umbraco release around version 6.2 (maybe 6.1, but still need to do some review, etc… to make sure its 100% backwards compatible).

One other bit of cool news is that we’re adding an official Examine Management dashboard to Umbraco 6.1. In its present state it supports optimizing indexes, rebuilding indexes and searching them. I’ve created a quick video showing its features :)

Examine management dashboard for Umbraco