@Shazwazza

Shannon Deminick's blog all about web development

Filtering fields dynamically with Examine

July 6, 2020 04:05
Filtering fields dynamically with Examine

The index fields created by Umbraco in Examine by default can lead to quite a substantial amount of fields. This is primarily due in part by how Umbraco handles variant/culture data because it will create a different field per culture but there are other factors as well. Umbraco will create a “__Raw_” field for each rich text field and if you use the grid, it will create different fields for each grid row type. There are good reasons for all of these fields and this allows you by default to have the most flexibility when querying and retrieving your data from the Examine indexes. But in some cases these default fields can be problematic. Examine by default uses Lucene as it’s indexing engine and Lucene itself doesn’t have any hard limits on field count (as far as I know), however if you swap the indexing engine in Examine to something else like Azure Search with ExamineX then you may find your indexes are exceeding Azure Search’s limits.

Azure Search field count limits

Azure Search has varying limits for field counts based on the tier service level you have (strangely the Free tier allows more fields than the Basic tier). The absolute maximum however is 1000 fields and although that might seem like quite a lot when you take into account all of the fields created by Umbraco you might realize it’s not that difficult to exceed this limit. As an example, lets say you have an Umbraco site using language variants and you have 20 languages in use. Then let’s say you have 15 document types each with 5 fields (all with unique aliases) and each field is variant and you have content for each of these document types and languages created. This immediately means you are exceeding the field count limits: 20 x 15 x 10 = 1500 fields! And that’s not including the “__Raw_” fields or the extra grid fields or the required system fields like “id” and “nodeName”. I’m unsure why Azure Search even has this restriction in place

Why is Umbraco creating a field per culture?

When v8 was being developed a choice had to be made about how to handle multi-lingual data in Examine/Lucene. There’s a couple factors to consider with making this decision which mostly boils down to how Lucene’s analyzers work. The choice is either: language per field or language per index. Some folks might think, can’t we ‘just’ have a language per document? Unfortunately the answer is no because that would require you to apply a specific language analyzer for that document and then scoring would no longer work between documents. Elastic Search has a good write up about this. So either language per field or different indexes per language. Each has pros/cons but Umbraco went with language per field since it’s quite easy to setup, supports different analyzers per language and doesn’t require a ton of indexes which also incurs a lot more overhead and configuration.

Do I need all of these fields?

That really depends on what you are searching on but the answer is most likely ‘no’. You probably aren’t going to be searching on over 1000s fields, but who knows every site’s requirements are different. Umbraco Examine has something called an IValueSetValidator which you can configure to include/exclude certain fields or document types. This is synonymous with part of the old XML configuration in Examine. This is one of those things where configuration can make sense for Examine and @callumwhyte has done exactly that with his package “Umbraco Examine Config”. But the IValueSetValidator isn’t all that flexible and works based on exact naming which will work great for filtering content types but perhaps not field names. (Side note – I’m unsure if the Umbraco Examine Config package will work alongside ExamineX, need to test that out).

Since Umbraco creates fields with the same prefixed names for all languages it’s relatively easy to filter the fields based on a matching prefix for the fields you want to keep.

Here’s some code!

The following code is relatively straight forward with inline comments: A custom class “IndexFieldFilter” that does the filtering and can be applied different for any index by name, a Component to apply the filtering, a Composer to register services. This code will also ensure that all Umbraco required fields are retained so anything that Umbraco is reliant upon will still work.

/// <summary>
/// Register services
/// </summary>
public class MyComposer : ComponentComposer<MyComponent>
{
    public override void Compose(Composition composition)
    {
        base.Compose(composition);
        composition.RegisterUnique<IndexFieldFilter>();
    }
}

public class MyComponent : IComponent
{
    private readonly IndexFieldFilter _indexFieldFilter;

    public MyComponent(IndexFieldFilter indexFieldFilter)
    {
        _indexFieldFilter = indexFieldFilter;
    }

    public void Initialize()
    {
        // Apply an index field filter to an index
        _indexFieldFilter.ApplyFilter(
            // Filter the external index 
            Umbraco.Core.Constants.UmbracoIndexes.ExternalIndexName, 
            // Ensure fields with this prefix are retained
            new[] { "description", "title" },
            // optional: only keep data for these content types, else keep all
            new[] { "home" });
    }

    public void Terminate() => _indexFieldFilter.Dispose();
}

/// <summary>
/// Used to filter out fields from an index
/// </summary>
public class IndexFieldFilter : IDisposable
{
    private readonly IExamineManager _examineManager;
    private readonly IUmbracoTreeSearcherFields _umbracoTreeSearcherFields;
    private ConcurrentDictionary<string, (string[] internalFields, string[] fieldPrefixes, string[] contentTypes)> _fieldNames
        = new ConcurrentDictionary<string, (string[], string[], string[])>();
    private bool disposedValue;

    /// <summary>
    /// Constructor
    /// </summary>
    /// <param name="examineManager"></param>
    /// <param name="umbracoTreeSearcherFields"></param>
    public IndexFieldFilter(
        IExamineManager examineManager,
        IUmbracoTreeSearcherFields umbracoTreeSearcherFields)
    {
        _examineManager = examineManager;
        _umbracoTreeSearcherFields = umbracoTreeSearcherFields;
    }

    /// <summary>
    /// Apply a filter to the specified index
    /// </summary>
    /// <param name="indexName"></param>
    /// <param name="includefieldNamePrefixes">
    /// Retain all fields prefixed with these names
    /// </param>
    public void ApplyFilter(
        string indexName,
        string[] includefieldNamePrefixes,
        string[] includeContentTypes = null)
    {
        if (_examineManager.TryGetIndex(indexName, out var e)
            && e is BaseIndexProvider index)
        {
            // gather all internal index names used by Umbraco 
            // to ensure they are retained
            var internalFields = new[]
                {
                LuceneIndex.CategoryFieldName,
                LuceneIndex.ItemIdFieldName,
                LuceneIndex.ItemTypeFieldName,
                UmbracoExamineIndex.IconFieldName,
                UmbracoExamineIndex.IndexPathFieldName,
                UmbracoExamineIndex.NodeKeyFieldName,
                UmbracoExamineIndex.PublishedFieldName,
                UmbracoExamineIndex.UmbracoFileFieldName,
                "nodeName"
            }
                .Union(_umbracoTreeSearcherFields.GetBackOfficeFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeDocumentFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeMediaFields())
                .Union(_umbracoTreeSearcherFields.GetBackOfficeMembersFields())
                .ToArray();

            _fieldNames.TryAdd(indexName, (internalFields, includefieldNamePrefixes, includeContentTypes ?? Array.Empty<string>()));

            // Bind to the event to filter the fields
            index.TransformingIndexValues += TransformingIndexValues;
        }
        else
        {
            throw new InvalidOperationException(
                $"No index with name {indexName} found that is of type {typeof(BaseIndexProvider)}");
        }
    }

    private void TransformingIndexValues(object sender, IndexingItemEventArgs e)
    {
        if (_fieldNames.TryGetValue(e.Index.Name, out var fields))
        {
            // check if we should ignore this doc by content type
            if (fields.contentTypes.Length > 0 && !fields.contentTypes.Contains(e.ValueSet.ItemType))
            {
                e.Cancel = true;
            }
            else
            {
                // filter the fields
                e.ValueSet.Values.RemoveAll(x =>
                {
                    if (fields.internalFields.Contains(x.Key)) return false;
                    if (fields.fieldPrefixes.Any(f => x.Key.StartsWith(f))) return false;
                    return true;
                });
            }
        }
    }

    protected virtual void Dispose(bool disposing)
    {
        if (!disposedValue)
        {
            if (disposing)
            {
                // Unbind from the event for any bound indexes
                foreach (var keys in _fieldNames.Keys)
                {
                    if (_examineManager.TryGetIndex(keys, out var e)
                        && e is BaseIndexProvider index)
                    {
                        index.TransformingIndexValues -= TransformingIndexValues;
                    }
                }
            }
            disposedValue = true;
        }
    }

    public void Dispose()
    {
        Dispose(disposing: true);
        GC.SuppressFinalize(this);
    }
}

That should give you the tools you need to dynamically filter your index based on fields and content type’s if you need to get your field counts down. This would also be handy even if you aren’t using ExamineX and Azure Search since keeping an index size down and storing less data means less IO operations and storage size.

Examine and Azure Blob Storage

February 11, 2020 04:52
Examine and Azure Blob Storage

Quite some time ago - probably close to 2 years - I created an alpha version of an extension library to Examine to allow storing Lucene indexes in Blob Storage called Examine.AzureDirectory. This idea isn’t new at all and in fact there’s been a library to do this for many years called AzureDirectory but it previously had issues and it wasn’t clear on exactly what it’s limitations are. The Examine.AzureDirectory implementation was built using a lot of the original code of AzureDirectory but has a bunch of fixes (which I contributed back to the project) and different ways of working with the data. Also since Examine 0.1.90 still worked with lucene 2.x, this also made this compatible with the older Lucene version.

… And 2 years later, I’ve actually released a real version 🎉

Why is this needed?

There’s a couple reasons – firstly Azure web apps storage run on a network share and Lucene absolutely does not like it’s files hosted on a network share, this will bring all sorts of strange performance issues among other things. The way AzureDirectory works is to store the ‘master’ index in Blob Storage and then sync the required Lucene files to the local ‘fast drive’. In Azure web apps there’s 2x drives: ‘slow drive’ (the network share) and the ‘fast drive’ which is the local server’s temp files on local storage with limited space. By syncing the Lucene files to the local fast drive it means that Lucene is no longer operating over a network share. When writes occur, it writes back to the local fast drive and then pushes those changes back to the master index in Blob Storage. This isn’t the only way to overcome this limitation of Lucene, in fact Examine has shipped a work around for many years which uses something called SyncDirectory which does more or less the same thing but instead of storing the master index in Blob Storage, the master index is just stored on the ‘slow drive’.  Someone has actually taken this code and made a separate standalone project with this logic called SyncDirectory which is pretty cool!

Load balancing/Scaling

There’s a couple of ways to work around the network share storage in Azure web apps (as above), but in my opinion the main reason why this is important is for load balancing and being able to scale out. Since Lucene doesn’t work well over a network share, it means that Lucene files must exist local to the process it’s running in. That means that when you are load balancing or scaling out, each server that is handling requests will have it’s own local Lucene index. So what happens when you scale out further and another new worker goes online? This really depending on the hosting application… for example in Umbraco, this would mean that the new worker will create it’s own local indexes by rebuilding the indexes from the source data (i.e. database). This isn’t an ideal scenario especially in Umbraco v7 where requests won’t be served until the index is built and ready. A better scenario is that the new worker comes online and then syncs an existing index from master storage that is shared between all workers …. yes! like Blob Storage.

Read/Write vs Read only

Lucene can’t be written to concurrently by multiple processes. There are some workarounds here a there to try to achieve this by synchronizing processes with named mutex/semaphore locks and even AzureSearch tries to handle some of this by utilizing Blob Storage leases but it’s not a seamless experience. This is one of the reasons why Umbraco requires a ‘master’ web app for writing and a separate web app for scaling which guarantees that only one process writes to the indexes. This is the setup that Examine.AzureDirectory supports too and on the front-end/replica/slave web app that scales you will configure the provider to be readonly which guarantees it will never try to write back to the (probably locked) Blob Storage.

With this in place, when a new front-end worker goes online it doesn’t need to rebuild it’s own local indexes, it will just check if indexes exist and to do that will make sure the master index is there and then continue booting. At this stage there’s actually almost no performance overhead. Nothing actually happens with the local indexes until the index is referenced by this worker and when that happens Examine will lazily just sync the Lucene files that it needs locally.

How do I get it?

First thing to point out is that this first release is only for Examine 0.1.90 which is for Umbraco v7. Support for Examine 1.x and Umbraco 8.x will come out very soon with some slightly different install instructions.

The release notes of this are here, the install docs are here, and the Nuget package for this can be found here.

PM> Install-Package Examine.AzureDirectory -Version 0.1.90

To activate it, you need to add these settings to your web.config

<add key="examine:AzureStorageConnString" value="YOUR-STORAGE-CONNECTION-STRING" />
<add key="examine:AzureStorageContainer" value="YOUR-CONTAINER-NAME" />

Then for your master server/web app you’ll want to add a directoryFactory attribute to each of your indexers in ExamineSettings.config, for example:

<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
      supportUnpublished="true"
      supportProtected="true"
      directoryFactory="Examine.AzureDirectory.AzureDirectoryFactory, Examine.AzureDirectory"
      analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

For your front-end/replicate/slave server you’ll want a different readonly value for the directoryFactory like:

<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
      supportUnpublished="true"
      supportProtected="true"
      directoryFactory="Examine.AzureDirectory.ReadOnlyAzureDirectoryFactory, Examine.AzureDirectory"
      analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

Does it work?

Great question :) With the testing that I’ve done it works and I’ve had this running on this site for all of last year without issue but I haven’t rigorously tested this at scale with high traffic sites, etc… I’ve decided to release a real version of this because having this as an alpha/proof of concept means that nobody will test or use it. So now hopefully a few of you will give this a whirl and let everyone know how it goes. Any bugs can be submitted to the Examine repo.

 

 

Examine 1.5.1 released

April 5, 2013 19:59

I’ve created a new release of Examine today, version 1.5.1. There’s nothing really new in this release, just a bunch of bug fixes. The other cool thing is that I’ve finally got Examine on Nuget now. The v1.5.1 release page is here on CodePlex with upgrade instructions… which is really just replacing the DLLs.

Its important to note that if you have installed Umbraco 6.0.1+ or 4.11.5+ then you already have Examine 1.5.0  installed (which isn’t an official release on the CodePlex page) which has 8 of these 10 bugs fixed already.

Bugs fixed

Here’s the full list of bugs fixed in this release:

UmbracoExamine

You may already know this but we’ve moved the UmbracoExamine libraries in to the core of Umbraco so that the Umbraco core team can better support the implementation. That means that only the basic Examine libraries will continue to exist @ examine.codeplex.com. The release of 1.5.1 only relates to the base Examine libraries, not the UmbracoExamine libraries, but that’s ok you can still upgrade these base libraries without issue.

Nuget

There’s 2 Examine projects up on Nuget, the basic Examine package and the Azure package if you wish to use Azure directory for your indexes.

Standard package:

PM> Install-Package Examine

Azure package:

PM> Install-Package Examine.Azure

 

Happy searching!

New Examine updates and features for Umbraco

March 6, 2013 00:42

It’s been a long while since Examine got some much needed attention and I’m pleased to say it is now happening. If you didn’t know already, we’ve moved the Umbraco Examine source in to the core of Umbraco. The underlying Examine (Examine.dll) core will remain on CodePlex but all the Umbraco bits and pieces which is found in UmbracoExamine.dll are in the Umbraco core from version 6.1+. This is great news because now we can all better support the implementation of Examine for Umbraco. More good news is that even versions prior to Umbraco 6.1 will have some bugs fixed (http://issues.umbraco.org/issue/U4-1768) ! Niels Kuhnel has also jumped aboard the Examine train and is helping out a ton by adding his amazing ‘facet’ features which will probably make it into an Umbraco release around version 6.2 (maybe 6.1, but still need to do some review, etc… to make sure its 100% backwards compatible).

One other bit of cool news is that we’re adding an official Examine Management dashboard to Umbraco 6.1. In its present state it supports optimizing indexes, rebuilding indexes and searching them. I’ve created a quick video showing its features :)

Examine management dashboard for Umbraco

Ultra fast media performance in Umbraco

April 25, 2011 02:24

There’s a few different ways to query Umbraco for media: using the new Media(int) API , using the umbraco.library.GetMedia(int, false) API or querying for media with Examine. I suppose there’s quite a few people out there that don’t use Examine yet and therefore don’t know that all of the media information is actually stored there too! The problem with the first 2 methods listed above is that they make database queries, the 2nd method is slightly better because it has built in caching, but the Examine method is by far the fastest and most efficient.

The following table shows you the different caveats that each option has:

new Media(int)

library.GetMedia(int,false)

Examine

Makes DB calls

yes

yes

no

Caches result

no

yes

no

Real time data

yes

yes

no

You might note that Examine doesn’t cache the result whereas the GetMedia call does, but don’t let this fool you because the Examine searcher that returns the result will be nearly as fast as ‘In cache’ data but won’t require the additional memory that the GetMedia cache does. The other thing to note is that Examine doesn’t have real time data. This means that if an administrator creates/saves a new media item it won’t show up in the Examine index instantaneously, instead it may take up to a minute to be ingested into the index. Lastly, its obvious that the new Media(int) API isn’t a very good way of accessing Umbraco media because it makes a few database calls per media item and also doesn’t cache the result.

Examine would be the ideal way to access your media if it was real time, so instead, we’ll combine the efforts of Examine and library.GetMedia(int,false) APIs. First will check if Examine has the data, and if not, revert to the GetMedia API. This method will do this for us and return a new object called MediaValues which simply contains a Name and Values property:

First here’s the usage of the new API below:

var media = MediaHelper.GetUmbracoMedia(1234); var mediaFile = media["umbracoFile"];

That’s a pretty easy way to access media. Now, here’s the code to make it work:

public static MediaValues GetUmbracoMedia(int id) { //first check in Examine as this is WAY faster var criteria = ExamineManager.Instance .SearchProviderCollection["InternalSearcher"] .CreateSearchCriteria("media"); var filter = criteria.Id(id); var results = ExamineManager .Instance.SearchProviderCollection["InternalSearcher"] .Search(filter.Compile()); if (results.Any()) { return new MediaValues(results.First()); } var media = umbraco.library.GetMedia(id, false); if (media != null && media.Current != null) { media.MoveNext(); return new MediaValues(media.Current); } return null; }

 

The MediaValues class definition:

public class MediaValues { public MediaValues(XPathNavigator xpath) { if (xpath == null) throw new ArgumentNullException("xpath"); Name = xpath.GetAttribute("nodeName", ""); Values = new Dictionary<string, string>(); var result = xpath.SelectChildren(XPathNodeType.Element); while(result.MoveNext()) { if (result.Current != null && !result.Current.HasAttributes) { Values.Add(result.Current.Name, result.Current.Value); } } } public MediaValues(SearchResult result) { if (result == null) throw new ArgumentNullException("result"); Name = result.Fields["nodeName"]; Values = result.Fields; } public string Name { get; private set; } public IDictionary<string, string> Values { get; private set; } }

That’s it! Now you have the benefits of Examine’s ultra fast data access and real-time data in case it hasn’t made it into Examine’s index yet.

Searching Umbraco using Razor and Examine

March 15, 2011 21:51
This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.
Since Razor is really just c# it’s super simple to run a search in Umbraco using Razor and Examine.  In MVC the actual searching should be left up to the controller to give the search results to your view, but in Umbraco 4.6 + , Razor is used as macros which actually ‘do stuff’. Here’s how incredibly simple it is to do a search:
@using Examine; @* Get the search term from query string *@ @{var searchTerm = Request.QueryString["search"];} <ul class="search-results"> @foreach (var result in ExamineManager.Instance.Search(searchTerm, true)) { <li> <span>@result.Score</span> <a href="@umbraco.library.NiceUrl(result.Id)"> @result.Fields["nodeName"] </a> </li> } </ul>

That’s it! Pretty darn easy.

And for all you sceptics who think there’s too much configuration involved to setup Examine, configuring Examine requires 3 lines of code. Yes its true, 3 lines, that’s it. Here’s the bare minimum setup:

1. Create an indexer under the ExamineIndexProviders section:

<add name="MyIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"/>

2. Create a searcher under the ExamineSearchProviders section:

<add name="MySearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"/>

3. Create an index set under the ExamineLuceneIndexSets config section:

<IndexSet SetName="MyIndexSet" IndexPath="~/App_Data/TEMP/MyIndex" />

This will index all of your data in Umbraco and allow you to search against all of it. If you want to search on specific subsets, you can use the FluentAPI to search and of course if you want to modify your index, there’s much more you can do with the config if you like.

With Examine the sky is the limit, you can have an incredibly simple index and search mechanism up to an incredibly complex index with event handlers, etc… and a very complex search with fuzzy logic, proximity searches, etc…  And no matter what flavour you choose it is guaranteed to be VERY fast and doesn’t matter how much data you’re searching against.

I also STRONGLY advise you to use the latest release on CodePlex: http://examine.codeplex.com/releases/view/50781 . There will also be version 1.1 coming out very soon.

Enjoy!

Examine output indexing

November 2, 2010 07:39
This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.
Last week Pete Gregory (@pgregorynz) and I were discussing different implementations of Examine. Particularly when you need to use Examine events to collate information from different nodes to put into the index for the page being rendered. An example of this is an FAQ engine where you might have an Umbraco content structure such as:
  • Site Container
    • Public
      • FAQs
        • FAQ Item 1
        • FAQ Item 2
        • FAQ Item 3

In this example, the page that is rendered to the end user is FAQs but the data from all 4 nodes (FAQs, FAQ Item 1 –> 4) needs to be added to the index for the FAQs page. To do this you can use Examine events, either using the GatheringNodeData of the BaseIndexProvider, or by using the DocumentWriting event of the UmbracoContentIndexer (I’ll write another post covering the difference between these two events and why they both exist). Though writing Examine event handlers to put the data from FAQ Item 1 –> 4 into the FAQs index isn’t very difficult, it would still be really cool if all of this could be done automatically.

Pete mentioned it would be cool if we could just index the output html of a page (sort of like Google) and suddenly the ideas started to flow. This concept is actually quite easy to do so within the next month or so we’ll probably release a beta of Examine Output Indexing. Here’s the way it’ll all get put together:

  • An HttpModule will be created to do 2 things:
    • Check if the current request is an Umbraco page request
      • If it is, we can easily get the current node being rendered since it’s already been added to the HttpContext items by Umbraco
      • Use the standard Examine handlers to enter the node’s data into the indexes based on the configuration you’ve specified in your Examine configuration files
    • Get the HTML output of the page before it is rendered to the end user, parse the html to get the relevant data and put it into the index for the current Umbraco page
  • We figured that it would also be cool to have an Examine node property that developers could defined called something like: examineNoIndex which we could check for when we determine that it’s an Umbraco page and if this property is set to true, we’ll not index this page.
    • This could give developers more control over what specific pages shouldn’t be indexed based directly from the CMS properties instead of writing custom events

With the above, a developer will simply need to put the HttpModule in their web.config, define an Examine index based on a new provider we create and that’s it. There will be no need to manually collate node data such as the above FAQ example. However, please note that this will work for straight forward searching so if you have complex searching & indexing requirements, I would still recommend using events since you have far more control over what information is indexed.

Any feedback is much appreciated since we haven’t started developing this quite yet.

Examine v1.0 RTM

October 22, 2010 21:46
This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.
We finally released Examine version 1.0 a week or so ago. You can find the latest download package from the CodePlex downloads page for Examine: http://examine.codeplex.com/releases/view/50781 

Here’s what you’ll need to know

  • There are some breaking changes from the version that is shipped with Umbraco 4.5 and also from the Examine RC3 release. The downloads tab on CodePlex contains the Release Notes for download which contains all of the information on upgrading & breaking changes
  • READ THE RELEASE NOTES BEFORE UPGRADING
  • There’s a ton of bugs fixed in this release from the version shipped with Umbraco 4.5
  • Lots of new features have been added:
    • Indexing ANY type of data easily using the LuceneEngine index/search providers
    • PDF Indexing for Umbraco
    • XSLT extensions for Umbraco
    • Data Type declarations for indexed fields
    • Date & Number range searching
  • New documentation has been added to CodePlex

Using v1.0 RTM on Umbraco 4.5

The upgrade process from the Examine version shipped with 4.5 to v1.0 RTM should be pretty seamless (unless you are using some specific API calls as noted in the release notes). However, once you drop in the new DLLs you’ll probably notice that the internal search no longer works. This is due to a bug in the Umbraco 4.5. codebase and an non-optimal implementation of Examine which has to do with case sensitivity for application aliases (i.e. Content vs content ). The work-around is simple though: all we need to do is change the Analyzer used for the internal searcher in the Examine configuration file to use the StandardAnalyzer instead of the WhitespaceAnalyzer. This is because the WhitespaceAnalyzer is case sensitive whereas the StandardAnalyzer is not. This issue is fixed in Umbraco Juno (4.6) and will continue to use the WhitespaceAnalyzer so that Examine doesn’t tokenize strings that contain punctuation. For more info on Analyzers, have a look at Aaron’s post.

Next Versions

There probably won’t be too many more changes coming for Examine v1.0 apart from any bug fixing that needs to be done and maybe some tweaks to the Fluent API. We will start working on v2.0 at some point this year or early next year which will take Examine to the next level. It will be less focused on configuration, have a smaller foot print and be much more configurable through code (such as how ASP.Net MVC works).

Searching Multi-Node Tree Picker data (or any collection) with Examine

September 23, 2010 04:10
This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.
With the release of uComponents recently a lot of people are starting to work with a new data type called the MultiNodeTreePicker, and with this I’ve seen a few questions around searching the data it generates using Examine.

The problem is there is a catch, if you’re using the CSV storage (which you must if you’re working with Examine) you’ll hit a problem, the Examine index will have something like this:

1011,1231,1232,1225

But how do you search on that? Searching for ‘1231’ will not return anything, because it’s prefixed with ‘,’ and postfixed with ‘,’. So this brings a problem, how do you search?

Bring on Events

As Shannon spoke about at CodeGarden 10 Examine has a number of different events you can hook into to do different things (slides and code) and this is what we’re going to need to work with.

I’ve touched on events before but this time we’re going to look at a different event, we’re going to look at the GatheringNodeData event.

GatheringNodeData event

So this event in Examine is fired while Examine is scraping the data out of an XML element which it has received. This XML could be from Umbraco (in the scenario we’re looking at here) or it could be from your own data source, and the event is raised once Examine as turned the XML into a Key/ Value representation of it.

The event that raises has custom event arguments, which has a property called Fields. This Fields property is a dictionary which contains the full Key/ Value representation of the data which will end up in Examine!

Now this dictionary is able to be manipulated, so you can add/ remove data as you see if (but that’s a topic for another blog), it also means you can change the data!

Changing the data for our needs

As I mentioned at the start of this we end up with comma-separated string from the datatype which isn’t useful for searching, so we can use an event handler to change what we’ve got. First we need to tie an event handler

public class ExamineEvents : ApplicationBase 
{
	public ExamineEvents() 
	{
		var indexer = ExamineManager.Instance.IndexProviderCollection["MyIndexer"];
		indexer.GatheringNodeData += new EventHandler(GatheringNodeDataHandler);
	}

	void GatheringNodeDataHandler(object sender, IndexingNodeDataEventArgs e)
	{
		//do stuff here
	}
}

So this is just a simple wire-up, using the ApplicationBase class in Umbraco so that it’ll be created on application start-up. Next we need to implement the event handler:

void GatheringNodeDataHandler(object sender, IndexingNodeDataEventArgs e)
{
	//grab the current data from the Fields collection
	var mntp = e.Fields["TreePicker"];
	//let's get rid of those commas!
	mntp = mntp.Replace(",", " ");
	//now put it back into the Fields so we can pretend nothing happened!
	e.Fields["TreePicker"] = mntp;
}

And you’re done! Now the data will be written into the index with spaces rather than commas meaning that you can search on each ID without the need for wildcards or any other “hacks” to get it to work.

Note: This will work in the majority of cases, the only reason it’ll fail is if you’re using an analyzer that strips out numbers before indexing. For more information about Lucene analyzers take a look at this article: http://www.aaron-powell.com/lucene-analyzer

Text casing and Examine

August 24, 2010 08:41
This post was imported from FARMCode.org which has been discontinued. These posts now exist here as an archive. They may contain broken links and images.
A few times I’ve seen questions posted on the Umbraco forums which ask how to deal with case insensitivity text with Examine, and it’s also something that we’ve had to handle a few times within our own company.

Here’s a scenario:

  • You have a site search
  • You use examine
  • You want to show the results looking exactly the same as it was before it went into Examine

If you’re running a standard install you’ll notice that the content always ends up lowercased!

This is a bit of a problem, page titles will be lowercase, body content will be lowercase, etc. Part of this will be due to a mistake in Examine, part of it is due to the design of Lucene.

In this article I’ll have a look at what you need to do to make it work as you’d expect.

First, some background

Before we dive directly into what to do to fix it you really should understand what is happening. If you don’t care feel free to skip over this bit though :P.

Searching is a tricky thing, and when searching the statement Examine == examine = false; To get around this searching is best done in a case insensitive manner. To make this work Examine did a forced lowercase of the content before it was pushed into Lucene.Net. This was to ensure that everything was exactly the same when it was searched against.
In hindsight this is not really a great idea, it really should be the responsibility of the Lucene Analyzer to handle this for you.

Many of the common Lucene.Net analyzers actually do automatic lowercasing of content, these analysers are:

  • StandardAnalyzer
  • StopAnalyzer
  • SimpleAnalyzer

So if you’re using the standard Examine config you’ll find yourself using the StandardAnalyzer and still have your content lowercased.

This means that there’s no need to Lucene to concern itself about case sensitivity when searching, everything is parsed by the analyzer (field terms and queries) and you’ll get more matches.

So how do I get around this?

Now that we’ve seen why all your content is generally lower case, how can we work with it in the original format and display it back to the UI?

Well we need some way in which we can have the field data stored without the analyzer screwing around with it.

Note: This doesn’t need to be done if you’re using an analyzer which doesn’t have a LowerCaseTokenizer or LowercaseFilter. If you’re using a different analyzer, like KeywordAnalyzer then this post wont cover what you’re after (since the KeywordAnalyzer isn’t lowercasing, you’re actually using an out-dated version of Examine, I recommend you grab the latest release :)). More information on Analyzers can be found at http://www.aaron-powell.com/lucene-analyzer

Luckily we’ve got some hooks into Examine to allow us to do what we need here, it’s in the form of an event on the Examine.LuceneEngine.Providers.LuceneIndexer, called DocumentWriting. Note that this event is on the LuceneIndexer, not the BaseIndexProvider. This event is Lucene.Net specific and not logical on the base class which is agnostic of any other framework.

What we can do with this event is interact directly with Lucene.Net while Examine is working with it.
You’ll need to have a bit of an understanding of how to work with a Lucene.Net Document (and for that I’d recommend having a read of this article from me: http://www.aaron-powell.com/documents-in-lucene-net), cuz what you’re able to do is play with Lucene.Net… Feel the power!

So we can attach the event handler the same way as you would do any other event in Umbraco, using an Action Handler:

public class UmbracoEvents : ApplicationBase
{
	public UmbracoEvents()
        {
            var indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["DefaultIndexer"];

            indexer.DocumentWriting +=new System.EventHandler(indexer_DocumentWriting);
        }
}

To do this we’ve got to cast the indexer so we’ve got the Lucene version to work with, then we’re attaching to our event handler. Let’s have a look at the event handler

void indexer_DocumentWriting(object sender, DocumentWritingEventArgs e)
{
	//grab out lucene document from the event arguments
	var doc = e.Document;

	//the e.Fields dictionary is all the fields which are about to be inserted into Lucene.Net
	//we'll grab out the "bodyContent" one, if there is one to be indexed
	if(e.Fields.ContainsKey("bodyContent")) 
	{
		string content = e.Fields["bodyContent"];
		//Give the field a name which you'll be able to easily remember
		//also, we're telling Lucene to just put this data in, nothing more
		doc.Add(new Field("__bodyContent", content, Field.Store.YES, Field.Index.NOT_ANALYZED));
	}
}

And that’s how you can push data in. I’d recommend that you do a conditional check to ensure that the property you’re looking for does exist in the Fields property of the event args, unless you’re 100% sure that it appears on all the objects which you’re indexing.

Lastly we need to display that on the UI, well it’s easy, rather accessing the bodyContent property of the SearchResults, use the __bodyContent and you’ll get your unanalyzed version.

Conclusion

Here we’ve looked at how we can use the Examine events to interact with the Lucene.Net Document. We’ve decided that we want to push in unanalyzed text, but you could use this idea to really tweak your Lucene.Net document. But really playing with the Document is not recommended unless you *really* know what you’re doing ;).