Those assumptions? You should validate them…

The one thing that is true of every aspect of IT is that it is always changing. And that change means that things you were confident of in the past may no longer hold true.

I was reminded of this while sitting in the pub with some developers recently, talking about querying for items by path in Sitecore. The debate about the best way to do this raged, but a common thread of the debate was that it is often said that the fastest way to find a set of items you needed is via a ContentSearch index. That assumptions has its roots in the time when most sites were using Lucene to run queries, and for queries with more complex matching rules. But does that hold true here?

Continue reading

Tripping over Liskov Substitution and search

When you’re working with a “provider” model for services in your applications you get used to the assumption that everything follows the Liskov Substitution Principle and whatever provider you plug in will work in the same way. Unfortunately, for software our in the real world that’s not always entirely true. Recently I came across an example of this which helped point out a bug in some search code in Sitecore…

The scenario

A component I found myself looking at was using the ContentSearch APIs to perform some queries and then render UI based on the results. There wasn’t anything special going on. It was just finding an appropriate index, building up a query, running it and then displaying how many items matched. The relevant bit was vaguely along the lines of:

var index = fetchContextIndex(someContentItem);
var predicate = buildTheSearchCriteria(currentState);

using (IProviderSearchContext context = index.CreateSearchContext())
{
    var query = context
        .GetQueryable<SearchResultItem>()
        .Filter(predicate);

    var fullResultsSet = query.GetResults();
    var totalResults = fullResultsSet.Count();

    // Display the number of matches
}

The confusion

The code started off running against an index managed by Lucene. With the particular set of content on the server, the value of the variable totalResults came back as 97. That seemed a sensible value, as there were roughly that number of items that matched the search criteria. But later the code got migrated to a server that was using Coveo to index the same content. And once that had happened, the value of totalResults always came back as 10, despite there being more matching pages in both the content tree and in the Coveo index.

Cue some head scratching

The solution

After a bit of fun with Google and poking about with the debugger, the subtle issue revealed itself: The code above uses the fullResultsSet.Count() method to fetch the total number of index hits that the search framework found for the query. At first glance that looks fine – the fullResultsSet object exposes the IEnumerable interface – so calling Count() seems a perfectly reasonable way to get the size of the results when there’s no pagination involved in the query.

But as some of you no doubt already spotted, that’s not the documented way you’re supposed to get the total number of results for a query. As a number of Google hits point out, the property TotalSearchResults is the thing we should be using here. And that returns the correct value for both Coveo and Lucene.

If the query had included pagination, the issue would have revealed itself straight away, as that would have highlighted the different behaviours of Count() and TotalSearchResults when your query result set is bigger than the results page size. But because the code in question didn’t do that, the bug slipped through…

Why does it behave like this?

Well getting past the initial slightly petulant “just to confuse us!” response, it’s all down to implementation details…

If you look into the code for the SearchResults<TSource> you’ll see that this class exposing both the property TotalSearchResults and an IEnumerable:

Search Results Code

The code for the TotalSearchResults property is set specifically by the provider generating the results:

public int TotalSearchResults
{
	get;
	private set;
}

That value is set by the constructor, and it can be independent of the size of results page being returned for this query.

But the value of a call to Count() for this collection will be based on the enumerator that the class exposes. The implementation of IEnumerable returns an enumeration taken from the inner Hits collection:

IEnumerator<SearchHit<TSource>> IEnumerable<SearchHit<TSource>>.GetEnumerator()
{
	return this.Hits.GetEnumerator();
}

For Lucene, a query with no pagination will return all the index items matched up to the maximum defined in the config setting for “max result set size” (The ContentSearch.SearchMaxResults setting in your config files). In this case, that was more than 97 so the whole result set was returned and hence it looked like the code was working. But Coveo seems to default to a page of 10 results if you fail to specify pagination. If you think about it, that behaviour makes some sense. Lucene is running in the same process as your site, so it’s not a big issue for it to return all the result data if you don’t explicitly apply a pagination clause to your query. (You still should though!) It’s just shuffling memory about, which is fairly fast to do. However Coveo runs out-of-process (and in the worst case might be out in the cloud if you use the SAAS version) so defaulting to only returning details for the first 10 results if there is no pagination clause could help prevent performance issues from huge result sets being pushed across the network.

So take care people – Barbara Liskov might not approve, but sometimes you need to be wary about swapping out providers. There can be justifications for why behaviour isn’t always exactly the same, and those variations can lead to subtle bugs if you’re not paying attention…

And reading the documentation so you understand the right way to use the objects in question helps too 😉

Enabling automated index rebuilds

Another helpful addition to the “scripted installs” functions I’ve been looking at for the last few weeks is the ability to trigger a full rebuild of a search index. Last week we looked at deferring the indexing of items installed by a package to try and help speed up the scripted install of packages. So it makes sense to be able to trigger a build as well… Continue reading

That search index rebuild option is too hard to find…

The other day a colleague remarked on how much time was wasted navigating down to the search index rebuild options in the Sitecore 6.6 Control Panel. While there are a variety of other ways of triggering a search rebuild, (Sitecore Rocks for example) it was suggested that having a button on the Ribbon (as per Sitecore 7) would be helpful in this case. So here’s a quick set of instructions for one way it can be added: Continue reading

Sorting for search, when you’re living in the dark ages

I’ve written before about filtering data in Lucene searches if you’re still using Sitecore 6.x. Having been doing more legacy work on this front over the last couple of weeks, I’ve got a couple of new things to add. Previously, the search work I’d been doing had relied on the default “relevance” sort order, or LINQ OrderBy clauses. However recently I’ve needed to enable some more complicated sorting, which has lead me to a few new (to me, at least) discoveries. Continue reading

Publishing restrictions and search

I had to deal with a bug report in some Sitecore 6.6 / Advanced Database Crawler search code recently, relating to items with publishing restrictions not disappearing from search results until another publish occurred. It struck me that there’s not much written about how publishing restrictions interact with search, so I figured I should take a bit of time to write down what I’d found while sorting the bug. Continue reading