@Shazwazza

Shannon Deminick's blog all about web development

Spatial Search with Examine and Lucene

September 21, 2020 14:27
Spatial Search with Examine and Lucene

I was asked about how to do Spatial search with Examine recently which sparked my interest on how that should be done so here’s how it goes…

Examine’s default implementation is Lucene so by default whatever you can do in Lucene you can achieve in Examine by exposing the underlying Lucene bits. If you want to jump straight to code, I’ve created a couple of unit tests in the Examine project.

Source code as documentation

Lucene.Net and Lucene (Java) are more or less the same. There’s a few API and naming conventions differences but at the end of the day Lucene.Net is just a .NET port of Lucene. So pretty much any of the documentation you’ll find for Lucene will work with Lucene.Net just with a bit of tweaking. Same goes for code snippets in the source code and Lucene and Lucene.Net have tons of examples of how to do things. In fact for Spatial search there’s a specific test example for that.

So we ‘just’ need to take that example and go with it.

Firstly we’ll need the Lucene.Net.Contrib package:

Install-Package Lucene.Net.Contrib -Version 3.0.3

Indexing

The indexing part doesn't really need to do anything out of the ordinary from what you would normally do. You just need to get either latitude/longitude or x/y (numerical) values into your index. This can be done directly using a ValueSet when you index and having your field types set as numeric or it could be done with the DocumentWriting event which gives you direct access to the underlying Lucene document. 

Strategies

For this example I’m just going to stick with simple Geo Spatial searching with simple x/y coordinates. There’s different “stategies” and you can configure these to handle different types of spatial search when it’s not just as simple as an x/y distance calculation. I was shown an example of a Spatial search that used the “PointVectorStrategy” but after looking into that it seems like this is a semi deprecated strategy and even one of it’s methods says: “//TODO this is basically old code that hasn't been verified well and should probably be removed” and then I found an SO article stating that “RecursivePrefixTreeStrategy” was what should be used instead anyways and as it turns out that’s exactly what the java example uses too.

If you need some more advanced Spatial searching then I’d suggest researching some of the strategies available, reading the docs  and looking at the source examples. There’s unit tests for pretty much everything in Lucene and Lucene.Net.

Get the underlying Lucene Searcher instance

If you need to do some interesting Lucene things with Examine you need to gain access to the underlying Lucene bits. Namely you’d normally only need access to the IndexWriter which you can get from LuceneIndex.GetIndexWriter() and the Lucene Searcher which you can get from LuceneSearcher.GetSearcher().

// Get an index from the IExamineManager
if (!examineMgr.TryGetIndex("MyIndex", out var index))
    throw new InvalidOperationException("No index found with name MyIndex");
            
// We are expecting this to be a LuceneIndex
if (!(index is LuceneIndex luceneIndex))
    throw new InvalidOperationException("Index MyIndex is not a LuceneIndex");

// If you wanted a LuceneWriter, here's how:
//var luceneWriter = luceneIndex.GetIndexWriter();

// Need to cast in order to expose the Lucene bits
var searcher = (LuceneSearcher)luceneIndex.GetSearcher();

// Get the underlying Lucene Searcher instance
var luceneSearcher = searcher.GetLuceneSearcher();

Do the search

Important! Latitude/Longitude != X/Y

The Lucene GEO Spatial APIs take an X/Y coordinates, not latitude/longitude and a common mistake is to just use them in place but that’s incorrect and they are actually opposite so be sure you swap tham. Latitude = Y, Longitude = X. Here’s a simple function to swap them:

private void GetXYFromCoords(double lat, double lng, out double x, out double y)
{
    // change to x/y coords, longitude = x, latitude = y
    x = lng;
    y = lat;
}

Now that we have the underlying Lucene Searcher instance we can search however we want:

// Create the Geo Spatial lucene objects
SpatialContext ctx = SpatialContext.GEO;
int maxLevels = 11; //results in sub-meter precision for geohash
SpatialPrefixTree grid = new GeohashPrefixTree(ctx, maxLevels);
RecursivePrefixTreeStrategy strategy = new RecursivePrefixTreeStrategy(grid, GeoLocationFieldName);

// lat/lng of Sydney Australia
var latitudeSydney = -33.8688;
var longitudeSydney = 151.2093;
            
// search within 100 KM
var searchRadiusInKm = 100;

// convert to X/Y
GetXYFromCoords(latitudeSydney, longitudeSydney, out var x, out var y);

// Make a circle around the search point
var args = new SpatialArgs(
    SpatialOperation.Intersects,
    ctx.MakeCircle(x, y, DistanceUtils.Dist2Degrees(searchRadiusInKm, DistanceUtils.EARTH_MEAN_RADIUS_KM)));

// Create the Lucene Filter
var filter = strategy.MakeFilter(args);

// Create the Lucene Query
var query = strategy.MakeQuery(args);

// sort on ID
Sort idSort = new Sort(new SortField(LuceneIndex.ItemIdFieldName, SortField.INT));
TopDocs docs = luceneSearcher.Search(query, filter, MaxResultDocs, idSort);

// iterate raw lucene results
foreach(var doc in docs.ScoreDocs)
{
    // TODO: Do something with result
}

Filter vs Query?

The above code creates both a Filter and a Query that is being used to get the results but the SpatialExample just uses a “MatchAllDocsQuery” instead of what is done above. Both return the same results so what is happening with “strategy.MakeQuery”? It’s creating a ConstantScoreQuery which means that the resulting document “Score” will be empty/same for all results. That’s really all this does so it’s optional but really when searching on only locations with no other data Score doesn’t make a ton of sense. It is possible however to mix Spatial search filters with real queries.

Next steps

You’ll see above that the ordering is by Id but probably in a lot of cases you’ll want to sort by distance. There’s examples of this in the Lucene SpatialExample linked above and there’s a reference to that in this SO article too, the only problem is those examples are for later Lucene versions than the current Lucene.Net 3.x. But if there’s a will there’s a way and I’m sure with some Googling, code researching and testing you’ll be able to figure it out :)

The Examine docs pages need a little love and should probably include this info. The docs pages are just built in Jekyll and located in the /docs folder of the Examine repository. I would love any help with Examine’s docs if you’ve got a bit of time :)

As far as Examine goes though, there’s actually custom search method called “LuceneQuery” on the “LuceneSearchQueryBase” which is the object created when creating normal Examine queries with CreateQuery(). Using this method you can pass in a native Lucene Query instance like the one created above and it will manage all of the searching/paging/sorting/results/etc… for you so you don’t have to do some of the above manual work. However there is currently no method allowing a native Lucene Filter instances to be passed in like the one created above. Once that’s in place then some of the lucene APIs above wont be needed and this can be a bit nicer. Then it’s probably worthwhile adding another Nuget project like Examine.Extensions which can contain methods and functionality for this stuff, or maybe the community can do something like that just like Callum has done for Examine Facets.  What do you think?

Searching with IPublishedContentQuery in Umbraco

July 31, 2020 04:13
Searching with IPublishedContentQuery in Umbraco

I recently realized that I don’t think Umbraco’s APIs on IPublishedContentQuery are documented so hopefully this post may inspire some docs to be written or at least guide some folks on some functionality they may not know about.

A long while back even in Umbraco v7 UmbracoHelper was split into different components and UmbracoHelper just wrapped these. One of these components was called ITypedPublishedContentQuery and in v8 is now called IPublishedContentQuery, and this component is responsible for executing queries for content and media on the front-end in razor templates. In v8 a lot of methods were removed or obsoleted from UmbracoHelper so that it wasn’t one gigantic object and tries to steer developers to use these sub components directly instead. For example if you try to access UmbracoHelper.ContentQuery you’ll see that has been deprecated saying:

Inject and use an instance of IPublishedContentQuery in the constructor for using it in classes or get it from Current.PublishedContentQuery in views

and the UmbracoHelper.Search methods from v7 have been removed and now only exist on IPublishedContentQuery.

There are API docs for IPublishedContentQuery which are a bit helpful, at least will tell you what all available methods and parameters are. The main one’s I wanted to point out are the Search methods.

Strongly typed search responses

When you use Examine directly to search you will get an Examine ISearchResults object back which is more or less raw data. It’s possible to work with that data but most people want to work with some strongly typed data and at the very least in Umbraco with IPublishedContent. That is pretty much what IPublishedContentQuery.Search methods are solving. Each of these methods will return an IEnumerable<PublishedSearchResult> and each PublishedSearchResult contains an IPublishedContent instance along with a Score value. A quick example in razor:

@inherits Umbraco.Web.Mvc.UmbracoViewPage
@using Current = Umbraco.Web.Composing.Current;
@{
    var search = Current.PublishedContentQuery.Search(Request.QueryString["query"]);
}

<div>
    <h3>Search Results</h3>
    <ul>
        @foreach (var result in search)
        {
            <li>
                Id: @result.Content.Id
                <br/>
                Name: @result.Content.Name
                <br />
                Score: @result.Score
            </li>
        }
    </ul>
</div>

The ordering of this search is by Score so the highest score is first. This makes searching very easy while the underlying mechanism is still Examine. The IPublishedContentQuery.Search methods make working with the results a bit nicer.

Paging results

You may have noticed that there’s a few overloads and optional parameters to these search methods too. 2 of the overloads support paging parameters and these take care of all of the quirks with Lucene paging for you. I wrote a previous post about paging with Examine and you need to make sure you do that correctly else you’ll end up iterating over possibly tons of search results which can have performance problems. To expand on the above example with paging is super easy:

@inherits Umbraco.Web.Mvc.UmbracoViewPage
@using Current = Umbraco.Web.Composing.Current;
@{
    var pageSize = 10;
    var pageIndex = int.Parse(Request.QueryString["page"]);
    var search = Current.PublishedContentQuery.Search(
        Request.QueryString["query"],
        pageIndex * pageSize,   // skip
        pageSize,               // take
        out var totalRecords);
}

<div>
    <h3>Search Results</h3>
    <ul>
        @foreach (var result in search)
        {
            <li>
                Id: @result.Content.Id
                <br/>
                Name: @result.Content.Name
                <br />
                Score: @result.Score
            </li>
        }
    </ul>
</div>

Simple search with cultures

Another optional parameter you might have noticed is the culture parameter. The docs state this about the culture parameter:

When the culture is not specified or is *, all cultures are searched. To search for only invariant documents and fields use null. When searching on a specific culture, all culture specific fields are searched for the provided culture and all invariant fields for all documents. While enumerating results, the ambient culture is changed to be the searched culture.

What this is saying is that if you aren’t using culture variants in Umbraco then don’t worry about it. But if you are, you will also generally not have to worry about it either! What?! By default the simple Search method will use the “ambient” (aka ‘Current’) culture to search and return data. So if you are currently browsing your “fr-FR” culture site this method will automatically only search for your data in your French culture but will also search on any invariant (non-culture) data. And as a bonus, the IPublishedContent returned also uses this ambient culture so any values you retrieve from the content item without specifying the culture will just be the ambient/default culture.

So why is there a “culture” parameter? It’s just there in case you want to search on a specific culture instead of relying on the ambient/current one.

Search with IQueryExecutor

IQueryExecutor is the resulting object created when creating a query with the Examine fluent API. This means you can build up any complex Examine query you want, even with raw Lucene, and then pass this query to one of the IPublishedContentQuery.Search overloads and you’ll get all the goodness of the above queries. There’s also paging overloads with IQueryExecutor too. To further expand on the above example:

@inherits Umbraco.Web.Mvc.UmbracoViewPage
@using Current = Umbraco.Web.Composing.Current;
@{
    // Get the external index with error checking
    if (ExamineManager.Instance.TryGetIndex(
        Constants.UmbracoIndexes.ExternalIndexName, out var index))
    {
        throw new InvalidOperationException(
            $"No index found with name {Constants.UmbracoIndexes.ExternalIndexName}");
    }

    // build an Examine query
    var query = index.GetSearcher().CreateQuery()
        .GroupedOr(new [] { "pageTitle", "pageContent"},
            Request.QueryString["query"].MultipleCharacterWildcard());


    var pageSize = 10;
    var pageIndex = int.Parse(Request.QueryString["page"]);
    var search = Current.PublishedContentQuery.Search(
        query,                  // pass the examine query in!
        pageIndex * pageSize,   // skip
        pageSize,               // take
        out var totalRecords);
}

<div>
    <h3>Search Results</h3>
    <ul>
        @foreach (var result in search)
        {
            <li>
                Id: @result.Content.Id
                <br/>
                Name: @result.Content.Name
                <br />
                Score: @result.Score
            </li>
        }
    </ul>
</div>

The base interface of the fluent parts of Examine’s queries are IQueryExecutor so you can just pass in your query to the method and it will work.

Recap

The IPublishedContentQuery.Search overloads are listed in the API docs, they are:

  • Search(String term, String culture, String indexName)
  • Search(String term, Int32 skip, Int32 take, out Int64 totalRecords, String culture, String indexName)
  • Search(IQueryExecutor query)
  • Search(IQueryExecutor query, Int32 skip, Int32 take, out Int64 totalRecords)

Should you always use this instead of using Examine directly? As always it just depends on what you are doing. If you need a ton of flexibility with your search results than maybe you want to use Examine’s search results directly but if you want simple and quick access to IPublishedContent results, then these methods will work great.

Does this all work with ExamineX ? Absolutely!! One of the best parts of ExamineX is that it’s completely seamless. ExamineX is just an index implementation of Examine itself so all Examine APIs and therefore all Umbraco APIs that use Examine will ‘just work’.

Filtering fields dynamically with Examine

July 6, 2020 04:05
Filtering fields dynamically with Examine

The index fields created by Umbraco in Examine by default can lead to quite a substantial amount of fields. This is primarily due in part by how Umbraco handles variant/culture data because it will create a different field per culture but there are other factors as well. Umbraco will create a “__Raw_” field for each rich text field and if you use the grid, it will create different fields for each grid row type. There are good reasons for all of these fields and this allows you by default to have the most flexibility when querying and retrieving your data from the Examine indexes. But in some cases these default fields can be problematic. Examine by default uses Lucene as it’s indexing engine and Lucene itself doesn’t have any hard limits on field count (as far as I know), however if you swap the indexing engine in Examine to something else like Azure Search with ExamineX then you may find your indexes are exceeding Azure Search’s limits.

Azure Search field count limits

Azure Search has varying limits for field counts based on the tier service level you have (strangely the Free tier allows more fields than the Basic tier). The absolute maximum however is 1000 fields and although that might seem like quite a lot when you take into account all of the fields created by Umbraco you might realize it’s not that difficult to exceed this limit. As an example, lets say you have an Umbraco site using language variants and you have 20 languages in use. Then let’s say you have 15 document types each with 5 fields (all with unique aliases) and each field is variant and you have content for each of these document types and languages created. This immediately means you are exceeding the field count limits: 20 x 15 x 10 = 1500 fields! And that’s not including the “__Raw_” fields or the extra grid fields or the required system fields like “id” and “nodeName”. I’m unsure why Azure Search even has this restriction in place

Why is Umbraco creating a field per culture?

When v8 was being developed a choice had to be made about how to handle multi-lingual data in Examine/Lucene. There’s a couple factors to consider with making this decision which mostly boils down to how Lucene’s analyzers work. The choice is either: language per field or language per index. Some folks might think, can’t we ‘just’ have a language per document? Unfortunately the answer is no because that would require you to apply a specific language analyzer for that document and then scoring would no longer work between documents. Elastic Search has a good write up about this. So either language per field or different indexes per language. Each has pros/cons but Umbraco went with language per field since it’s quite easy to setup, supports different analyzers per language and doesn’t require a ton of indexes which also incurs a lot more overhead and configuration.

Do I need all of these fields?

That really depends on what you are searching on but the answer is most likely ‘no’. You probably aren’t going to be searching on over 1000s fields, but who knows every site’s requirements are different. Umbraco Examine has something called an IValueSetValidator which you can configure to include/exclude certain fields or document types. This is synonymous with part of the old XML configuration in Examine. This is one of those things where configuration can make sense for Examine and @callumwhyte has done exactly that with his package “Umbraco Examine Config”. But the IValueSetValidator isn’t all that flexible and works based on exact naming which will work great for filtering content types but perhaps not field names. (Side note – I’m unsure if the Umbraco Examine Config package will work alongside ExamineX, need to test that out).

Since Umbraco creates fields with the same prefixed names for all languages it’s relatively easy to filter the fields based on a matching prefix for the fields you want to keep.

Here’s some code!

The following code is relatively straight forward with inline comments: A custom class “IndexFieldFilter” that does the filtering and can be applied different for any index by name, a Component to apply the filtering, a Composer to register services. This code will also ensure that all Umbraco required fields are retained so anything that Umbraco is reliant upon will still work.

/// <summary>
/// Register services
/// </summary>
public class MyComposer : ComponentComposer<MyComponent>
{
    public override void Compose(Composition composition)
    {
        base.Compose(composition);
        composition.RegisterUnique<IndexFieldFilter>();
    }
}

public class MyComponent : IComponent
{
    private readonly IndexFieldFilter _indexFieldFilter;

    public MyComponent(IndexFieldFilter indexFieldFilter)
    {
        _indexFieldFilter = indexFieldFilter;
    }

    public void Initialize()
    {
        // Apply an index field filter to an index
        _indexFieldFilter.ApplyFilter(
            // Filter the external index 
            Umbraco.Core.Constants.UmbracoIndexes.ExternalIndexName, 
            // Ensure fields with this prefix are retained
            new[] { "description", "title" },
            // optional: only keep data for these content types, else keep all
            new[] { "home" });
    }

    public void Terminate() => _indexFieldFilter.Dispose();
}

/// <summary>
/// Used to filter out fields from an index
/// </summary>
public class IndexFieldFilter : IDisposable
{
    private readonly IExamineManager _examineManager;
    private readonly IUmbracoTreeSearcherFields _umbracoTreeSearcherFields;
    private ConcurrentDictionary<string, (string[] internalFields, string[] fieldPrefixes, string[] contentTypes)> _fieldNames
        = new ConcurrentDictionary<string, (string[], string[], string[])>();
    private bool disposedValue;

    /// <summary>
    /// Constructor
    /// </summary>
    /// <param name="examineManager"></param>
    /// <param name="umbracoTreeSearcherFields"></param>
    public IndexFieldFilter(
        IExamineManager examineManager,
        IUmbracoTreeSearcherFields umbracoTreeSearcherFields)
    {
        _examineManager = examineManager;
        _umbracoTreeSearcherFields = umbracoTreeSearcherFields;
    }

    /// <summary>
    /// Apply a filter to the specified index
    /// </summary>
    /// <param name="indexName"></param>
    /// <param name="includefieldNamePrefixes">
    /// Retain all fields prefixed with these names
    /// </param>
    public void ApplyFilter(
        string indexName,
        string[] includefieldNamePrefixes,
        string[] includeContentTypes = null)
    {
        if (_examineManager.TryGetIndex(indexName, out var e)
            && e is BaseIndexProvider index)
        {
            // gather all internal index names used by Umbraco 
            // to ensure they are retained
            var internalFields = new[]
                {
                LuceneIndex.CategoryFieldName,
                LuceneIndex.ItemIdFieldName,
                LuceneIndex.ItemTypeFieldName,
                UmbracoExamineIndex.IconFieldName,
                UmbracoExamineIndex.IndexPathFieldName,
                UmbracoExamineIndex.NodeKeyFieldName,
                UmbracoExamineIndex.PublishedFieldName,
                UmbracoExamineIndex.UmbracoFileFieldName,
                "nodeName"
            }
                .Union(_umbracoTreeSearcherFields.GetBackOfficeFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeDocumentFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeMediaFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeMembersFields())
                .ToArray();

            _fieldNames.TryAdd(indexName, (internalFields, includefieldNamePrefixes, includeContentTypes ?? Array.Empty<string>()));

            // Bind to the event to filter the fields
            index.TransformingIndexValues += TransformingIndexValues;
        }
        else
        {
            throw new InvalidOperationException(
                $"No index with name {indexName} found that is of type {typeof(BaseIndexProvider)}");
        }
    }

    private void TransformingIndexValues(object sender, IndexingItemEventArgs e)
    {
        if (_fieldNames.TryGetValue(e.Index.Name, out var fields))
        {
            // check if we should ignore this doc by content type
            if (fields.contentTypes.Length > 0 && !fields.contentTypes.Contains(e.ValueSet.ItemType))
            {
                e.Cancel = true;
            }
            else
            {
                // filter the fields
                e.ValueSet.Values.RemoveAll(x =>
                {
                    if (fields.internalFields.Contains(x.Key)) return false;
                    if (fields.fieldPrefixes.Any(f => x.Key.StartsWith(f))) return false;
                    return true;
                });
            }
        }
    }

    protected virtual void Dispose(bool disposing)
    {
        if (!disposedValue)
        {
            if (disposing)
            {
                // Unbind from the event for any bound indexes
                foreach (var keys in _fieldNames.Keys)
                {
                    if (_examineManager.TryGetIndex(keys, out var e)
                        && e is BaseIndexProvider index)
                    {
                        index.TransformingIndexValues -= TransformingIndexValues;
                    }
                }
            }
            disposedValue = true;
        }
    }

    public void Dispose()
    {
        Dispose(disposing: true);
        GC.SuppressFinalize(this);
    }
}

That should give you the tools you need to dynamically filter your index based on fields and content type’s if you need to get your field counts down. This would also be handy even if you aren’t using ExamineX and Azure Search since keeping an index size down and storing less data means less IO operations and storage size.

Examine and Azure Blob Storage

February 11, 2020 04:52
Examine and Azure Blob Storage

Quite some time ago - probably close to 2 years - I created an alpha version of an extension library to Examine to allow storing Lucene indexes in Blob Storage called Examine.AzureDirectory. This idea isn’t new at all and in fact there’s been a library to do this for many years called AzureDirectory but it previously had issues and it wasn’t clear on exactly what it’s limitations are. The Examine.AzureDirectory implementation was built using a lot of the original code of AzureDirectory but has a bunch of fixes (which I contributed back to the project) and different ways of working with the data. Also since Examine 0.1.90 still worked with lucene 2.x, this also made this compatible with the older Lucene version.

… And 2 years later, I’ve actually released a real version 🎉

Why is this needed?

There’s a couple reasons – firstly Azure web apps storage run on a network share and Lucene absolutely does not like it’s files hosted on a network share, this will bring all sorts of strange performance issues among other things. The way AzureDirectory works is to store the ‘master’ index in Blob Storage and then sync the required Lucene files to the local ‘fast drive’. In Azure web apps there’s 2x drives: ‘slow drive’ (the network share) and the ‘fast drive’ which is the local server’s temp files on local storage with limited space. By syncing the Lucene files to the local fast drive it means that Lucene is no longer operating over a network share. When writes occur, it writes back to the local fast drive and then pushes those changes back to the master index in Blob Storage. This isn’t the only way to overcome this limitation of Lucene, in fact Examine has shipped a work around for many years which uses something called SyncDirectory which does more or less the same thing but instead of storing the master index in Blob Storage, the master index is just stored on the ‘slow drive’.  Someone has actually taken this code and made a separate standalone project with this logic called SyncDirectory which is pretty cool!

Load balancing/Scaling

There’s a couple of ways to work around the network share storage in Azure web apps (as above), but in my opinion the main reason why this is important is for load balancing and being able to scale out. Since Lucene doesn’t work well over a network share, it means that Lucene files must exist local to the process it’s running in. That means that when you are load balancing or scaling out, each server that is handling requests will have it’s own local Lucene index. So what happens when you scale out further and another new worker goes online? This really depending on the hosting application… for example in Umbraco, this would mean that the new worker will create it’s own local indexes by rebuilding the indexes from the source data (i.e. database). This isn’t an ideal scenario especially in Umbraco v7 where requests won’t be served until the index is built and ready. A better scenario is that the new worker comes online and then syncs an existing index from master storage that is shared between all workers …. yes! like Blob Storage.

Read/Write vs Read only

Lucene can’t be written to concurrently by multiple processes. There are some workarounds here a there to try to achieve this by synchronizing processes with named mutex/semaphore locks and even AzureSearch tries to handle some of this by utilizing Blob Storage leases but it’s not a seamless experience. This is one of the reasons why Umbraco requires a ‘master’ web app for writing and a separate web app for scaling which guarantees that only one process writes to the indexes. This is the setup that Examine.AzureDirectory supports too and on the front-end/replica/slave web app that scales you will configure the provider to be readonly which guarantees it will never try to write back to the (probably locked) Blob Storage.

With this in place, when a new front-end worker goes online it doesn’t need to rebuild it’s own local indexes, it will just check if indexes exist and to do that will make sure the master index is there and then continue booting. At this stage there’s actually almost no performance overhead. Nothing actually happens with the local indexes until the index is referenced by this worker and when that happens Examine will lazily just sync the Lucene files that it needs locally.

How do I get it?

First thing to point out is that this first release is only for Examine 0.1.90 which is for Umbraco v7. Support for Examine 1.x and Umbraco 8.x will come out very soon with some slightly different install instructions.

The release notes of this are here, the install docs are here, and the Nuget package for this can be found here.

PM> Install-Package Examine.AzureDirectory -Version 0.1.90

To activate it, you need to add these settings to your web.config

<add key="examine:AzureStorageConnString" value="YOUR-STORAGE-CONNECTION-STRING" />
<add key="examine:AzureStorageContainer" value="YOUR-CONTAINER-NAME" />

Then for your master server/web app you’ll want to add a directoryFactory attribute to each of your indexers in ExamineSettings.config, for example:

<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
      supportUnpublished="true"
      supportProtected="true"
      directoryFactory="Examine.AzureDirectory.AzureDirectoryFactory, Examine.AzureDirectory"
      analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

For your front-end/replicate/slave server you’ll want a different readonly value for the directoryFactory like:

<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
      supportUnpublished="true"
      supportProtected="true"
      directoryFactory="Examine.AzureDirectory.ReadOnlyAzureDirectoryFactory, Examine.AzureDirectory"
      analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

Does it work?

Great question :) With the testing that I’ve done it works and I’ve had this running on this site for all of last year without issue but I haven’t rigorously tested this at scale with high traffic sites, etc… I’ve decided to release a real version of this because having this as an alpha/proof of concept means that nobody will test or use it. So now hopefully a few of you will give this a whirl and let everyone know how it goes. Any bugs can be submitted to the Examine repo.

 

 

Examine 1.5.1 released

April 5, 2013 19:59

I’ve created a new release of Examine today, version 1.5.1. There’s nothing really new in this release, just a bunch of bug fixes. The other cool thing is that I’ve finally got Examine on Nuget now. The v1.5.1 release page is here on CodePlex with upgrade instructions… which is really just replacing the DLLs.

Its important to note that if you have installed Umbraco 6.0.1+ or 4.11.5+ then you already have Examine 1.5.0  installed (which isn’t an official release on the CodePlex page) which has 8 of these 10 bugs fixed already.

Bugs fixed

Here’s the full list of bugs fixed in this release:

UmbracoExamine

You may already know this but we’ve moved the UmbracoExamine libraries in to the core of Umbraco so that the Umbraco core team can better support the implementation. That means that only the basic Examine libraries will continue to exist @ examine.codeplex.com. The release of 1.5.1 only relates to the base Examine libraries, not the UmbracoExamine libraries, but that’s ok you can still upgrade these base libraries without issue.

Nuget

There’s 2 Examine projects up on Nuget, the basic Examine package and the Azure package if you wish to use Azure directory for your indexes.

Standard package:

PM> Install-Package Examine

Azure package:

PM> Install-Package Examine.Azure

 

Happy searching!

New Examine updates and features for Umbraco

March 6, 2013 00:42

It’s been a long while since Examine got some much needed attention and I’m pleased to say it is now happening. If you didn’t know already, we’ve moved the Umbraco Examine source in to the core of Umbraco. The underlying Examine (Examine.dll) core will remain on CodePlex but all the Umbraco bits and pieces which is found in UmbracoExamine.dll are in the Umbraco core from version 6.1+. This is great news because now we can all better support the implementation of Examine for Umbraco. More good news is that even versions prior to Umbraco 6.1 will have some bugs fixed (http://issues.umbraco.org/issue/U4-1768) ! Niels Kuhnel has also jumped aboard the Examine train and is helping out a ton by adding his amazing ‘facet’ features which will probably make it into an Umbraco release around version 6.2 (maybe 6.1, but still need to do some review, etc… to make sure its 100% backwards compatible).

One other bit of cool news is that we’re adding an official Examine Management dashboard to Umbraco 6.1. In its present state it supports optimizing indexes, rebuilding indexes and searching them. I’ve created a quick video showing its features :)

Examine management dashboard for Umbraco

Ultra fast media performance in Umbraco

April 25, 2011 02:24

There’s a few different ways to query Umbraco for media: using the new Media(int) API , using the umbraco.library.GetMedia(int, false) API or querying for media with Examine. I suppose there’s quite a few people out there that don’t use Examine yet and therefore don’t know that all of the media information is actually stored there too! The problem with the first 2 methods listed above is that they make database queries, the 2nd method is slightly better because it has built in caching, but the Examine method is by far the fastest and most efficient.

The following table shows you the different caveats that each option has:

new Media(int)

library.GetMedia(int,false)

Examine

Makes DB calls

yes

yes

no

Caches result

no

yes

no

Real time data

yes

yes

no

You might note that Examine doesn’t cache the result whereas the GetMedia call does, but don’t let this fool you because the Examine searcher that returns the result will be nearly as fast as ‘In cache’ data but won’t require the additional memory that the GetMedia cache does. The other thing to note is that Examine doesn’t have real time data. This means that if an administrator creates/saves a new media item it won’t show up in the Examine index instantaneously, instead it may take up to a minute to be ingested into the index. Lastly, its obvious that the new Media(int) API isn’t a very good way of accessing Umbraco media because it makes a few database calls per media item and also doesn’t cache the result.

Examine would be the ideal way to access your media if it was real time, so instead, we’ll combine the efforts of Examine and library.GetMedia(int,false) APIs. First will check if Examine has the data, and if not, revert to the GetMedia API. This method will do this for us and return a new object called MediaValues which simply contains a Name and Values property:

First here’s the usage of the new API below:

var media = MediaHelper.GetUmbracoMedia(1234); var mediaFile = media["umbracoFile"];

That’s a pretty easy way to access media. Now, here’s the code to make it work:

public static MediaValues GetUmbracoMedia(int id) { //first check in Examine as this is WAY faster var criteria = ExamineManager.Instance .SearchProviderCollection["InternalSearcher"] .CreateSearchCriteria("media"); var filter = criteria.Id(id); var results = ExamineManager .Instance.SearchProviderCollection["InternalSearcher"] .Search(filter.Compile()); if (results.Any()) { return new MediaValues(results.First()); } var media = umbraco.library.GetMedia(id, false); if (media != null && media.Current != null) { media.MoveNext(); return new MediaValues(media.Current); } return null; }

 

The MediaValues class definition:

public class MediaValues { public MediaValues(XPathNavigator xpath) { if (xpath == null) throw new ArgumentNullException("xpath"); Name = xpath.GetAttribute("nodeName", ""); Values = new Dictionary<string, string>(); var result = xpath.SelectChildren(XPathNodeType.Element); while(result.MoveNext()) { if (result.Current != null && !result.Current.HasAttributes) { Values.Add(result.Current.Name, result.Current.Value); } } } public MediaValues(SearchResult result) { if (result == null) throw new ArgumentNullException("result"); Name = result.Fields["nodeName"]; Values = result.Fields; } public string Name { get; private set; } public IDictionary<string, string> Values { get; private set; } }

That’s it! Now you have the benefits of Examine’s ultra fast data access and real-time data in case it hasn’t made it into Examine’s index yet.

Searching Umbraco using Razor and Examine

March 15, 2011 21:51
This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.
Since Razor is really just c# it’s super simple to run a search in Umbraco using Razor and Examine.  In MVC the actual searching should be left up to the controller to give the search results to your view, but in Umbraco 4.6 + , Razor is used as macros which actually ‘do stuff’. Here’s how incredibly simple it is to do a search:
@using Examine; @* Get the search term from query string *@ @{var searchTerm = Request.QueryString["search"];} <ul class="search-results"> @foreach (var result in ExamineManager.Instance.Search(searchTerm, true)) { <li> <span>@result.Score</span> <a href="@umbraco.library.NiceUrl(result.Id)"> @result.Fields["nodeName"] </a> </li> } </ul>

That’s it! Pretty darn easy.

And for all you sceptics who think there’s too much configuration involved to setup Examine, configuring Examine requires 3 lines of code. Yes its true, 3 lines, that’s it. Here’s the bare minimum setup:

1. Create an indexer under the ExamineIndexProviders section:

<add name="MyIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"/>

2. Create a searcher under the ExamineSearchProviders section:

<add name="MySearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"/>

3. Create an index set under the ExamineLuceneIndexSets config section:

<IndexSet SetName="MyIndexSet" IndexPath="~/App_Data/TEMP/MyIndex" />

This will index all of your data in Umbraco and allow you to search against all of it. If you want to search on specific subsets, you can use the FluentAPI to search and of course if you want to modify your index, there’s much more you can do with the config if you like.

With Examine the sky is the limit, you can have an incredibly simple index and search mechanism up to an incredibly complex index with event handlers, etc… and a very complex search with fuzzy logic, proximity searches, etc…  And no matter what flavour you choose it is guaranteed to be VERY fast and doesn’t matter how much data you’re searching against.

I also STRONGLY advise you to use the latest release on CodePlex: http://examine.codeplex.com/releases/view/50781 . There will also be version 1.1 coming out very soon.

Enjoy!

Examine output indexing

November 2, 2010 07:39
This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.
Last week Pete Gregory (@pgregorynz) and I were discussing different implementations of Examine. Particularly when you need to use Examine events to collate information from different nodes to put into the index for the page being rendered. An example of this is an FAQ engine where you might have an Umbraco content structure such as:
  • Site Container
    • Public
      • FAQs
        • FAQ Item 1
        • FAQ Item 2
        • FAQ Item 3

In this example, the page that is rendered to the end user is FAQs but the data from all 4 nodes (FAQs, FAQ Item 1 –> 4) needs to be added to the index for the FAQs page. To do this you can use Examine events, either using the GatheringNodeData of the BaseIndexProvider, or by using the DocumentWriting event of the UmbracoContentIndexer (I’ll write another post covering the difference between these two events and why they both exist). Though writing Examine event handlers to put the data from FAQ Item 1 –> 4 into the FAQs index isn’t very difficult, it would still be really cool if all of this could be done automatically.

Pete mentioned it would be cool if we could just index the output html of a page (sort of like Google) and suddenly the ideas started to flow. This concept is actually quite easy to do so within the next month or so we’ll probably release a beta of Examine Output Indexing. Here’s the way it’ll all get put together:

  • An HttpModule will be created to do 2 things:
    • Check if the current request is an Umbraco page request
      • If it is, we can easily get the current node being rendered since it’s already been added to the HttpContext items by Umbraco
      • Use the standard Examine handlers to enter the node’s data into the indexes based on the configuration you’ve specified in your Examine configuration files
    • Get the HTML output of the page before it is rendered to the end user, parse the html to get the relevant data and put it into the index for the current Umbraco page
  • We figured that it would also be cool to have an Examine node property that developers could defined called something like: examineNoIndex which we could check for when we determine that it’s an Umbraco page and if this property is set to true, we’ll not index this page.
    • This could give developers more control over what specific pages shouldn’t be indexed based directly from the CMS properties instead of writing custom events

With the above, a developer will simply need to put the HttpModule in their web.config, define an Examine index based on a new provider we create and that’s it. There will be no need to manually collate node data such as the above FAQ example. However, please note that this will work for straight forward searching so if you have complex searching & indexing requirements, I would still recommend using events since you have far more control over what information is indexed.

Any feedback is much appreciated since we haven’t started developing this quite yet.

Examine v1.0 RTM

October 22, 2010 21:46
This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.
We finally released Examine version 1.0 a week or so ago. You can find the latest download package from the CodePlex downloads page for Examine: http://examine.codeplex.com/releases/view/50781 

Here’s what you’ll need to know

  • There are some breaking changes from the version that is shipped with Umbraco 4.5 and also from the Examine RC3 release. The downloads tab on CodePlex contains the Release Notes for download which contains all of the information on upgrading & breaking changes
  • READ THE RELEASE NOTES BEFORE UPGRADING
  • There’s a ton of bugs fixed in this release from the version shipped with Umbraco 4.5
  • Lots of new features have been added:
    • Indexing ANY type of data easily using the LuceneEngine index/search providers
    • PDF Indexing for Umbraco
    • XSLT extensions for Umbraco
    • Data Type declarations for indexed fields
    • Date & Number range searching
  • New documentation has been added to CodePlex

Using v1.0 RTM on Umbraco 4.5

The upgrade process from the Examine version shipped with 4.5 to v1.0 RTM should be pretty seamless (unless you are using some specific API calls as noted in the release notes). However, once you drop in the new DLLs you’ll probably notice that the internal search no longer works. This is due to a bug in the Umbraco 4.5. codebase and an non-optimal implementation of Examine which has to do with case sensitivity for application aliases (i.e. Content vs content ). The work-around is simple though: all we need to do is change the Analyzer used for the internal searcher in the Examine configuration file to use the StandardAnalyzer instead of the WhitespaceAnalyzer. This is because the WhitespaceAnalyzer is case sensitive whereas the StandardAnalyzer is not. This issue is fixed in Umbraco Juno (4.6) and will continue to use the WhitespaceAnalyzer so that Examine doesn’t tokenize strings that contain punctuation. For more info on Analyzers, have a look at Aaron’s post.

Next Versions

There probably won’t be too many more changes coming for Examine v1.0 apart from any bug fixing that needs to be done and maybe some tweaks to the Fluent API. We will start working on v2.0 at some point this year or early next year which will take Examine to the next level. It will be less focused on configuration, have a smaller foot print and be much more configurable through code (such as how ASP.Net MVC works).