Dealing with very long running search builds

A problem I’ve encountered a number of times in my Sitecore career is that when content trees are large and indexing tasks complex, it can take such a long time to perform a full rebuild of a search index that your web application can end up recycling for some reason before the build completes… Once content grows to this size, search can become quite difficult to manage, so I’ve been experimenting with a tool to help.

The background

You’ve probably experience the same thing: A search index build that runs for days… You have no idea how long its going to take to finish, and then something goes wrong and IIS resets itself losing your progress…
After it happened to me recently I decided it might be helpful if I could run a search index build from outside of the website process, so that IIS could recyle without it breaking the rebuild.

So as an experiment I’ve hacked together a command-line tool to try and help myself with this issue – maybe it could help you too?

Here’s how it works:

The index service endpoint

To allow a process outside the website to control indexing there needs to be some sort of web-service endpoint exposed by Sitecore so that some other process can send commands. I hacked together a quick ASPX page that can expose a simple HTTP GET => JSON type API, and added an option to my tool to allow you to deploy it into a Sitecore instance.

You can use the “Deploy” verb:

SearchIndexBuilder.exe deploy -w <your website folder> [-o] [-t <token>]

You specify the root folder of your website with the -w parameter, and the tool will write the endpoint file to there. The endpoint file requires a security token in your requests in order to respond to requests. By default the tool will generate you a random guid for the token, but if you’d prefer to specify something yourself you can use -t to pass it into the endpoint. Finally, the tool won’t overwrite an existing endpoint file if it finds one, but you can change that behaviour with the -o flag.

The tool’s output includes the token that was configured – so if you let it generate one for you it will tell you what it is.

When you’re done with the endpoint you can also use the tool to remove it again with the “Remove” verb. Again you pass the -w parameter to specify the website root folder to remove the endpoint from.

The endpoint should work with most versions of Sitecore from v7.0 and up. (It won’t work with older ones as it requires the ContentSearch APIs) I’ve tried it against v7.1, v7.2 and v9.0 so far.

Note that these two verbs needs to be run on the same server as the Sitecore instance (because they write directly to disk) but the other verbs can be run remotely as they only use HTTP(S) access.

Configuring a rebuild

Because I wanted the rebuild process to be able to survive things like an IIS reset or the server getting rebooted, it made sense to keep the key config for the process in a file. So the tool has the “Setup” verb which helps you generate a config file.

SearchIndexBuilder.exe setup -u <url of the endpoint> -d <database> -t <token> [-q <query for items>] [-c <config file name>] [-o]

You give the tool the web URL for the endpoint with the -u parameter. As with the endpoint deployment above, you don’t need to worry about the name of the endpoint file – just it’s path. Use HTTPS if you can though – to protect your security token. You pass the token into the config file with the -t parameter. Make sure you specify the same one that was used when you set up the endpoint above.

When the tool is generating the list of items to process it needs to pick one of your content databases to look at. The -d parameter lets you specify the name of the one you want to use.

By default the code uses a fairly high-performance database query against the Items table to get the list of content items that need re-indexing. It gets everything from the database you specified. However you can pass a Sitecore Query instead. If you use the -q option to pass a query then it will be executed by a SelectItems() call against your chosen database. So rememeber that this could have a much more significant performance impact if you write a query that will hit lots of items.

All the data worked out above gets written into a config file. By default it will save .\config.json but if you want to write to a different name or location then you can use -c to pass that in. And as above, if you need to overwrite an existing file then the -o flag will enable that.

Since this is just a JSON file, you can edit it with your favourite text editor. By default it configures itself to try and rebuild all of the indexes that are configured in your server. But you can edit this set by removing entries from the config file. You can also make tweaks to the list of items to process if you fancy.

Running a build…

Once you have a config file, the “Index” verb can be used to process it:

SearchIndexBuilder.exe index [-c <config file>][-o <output Every X items>] [-r <retries in case of error>] [-p <ms to pause for>] [-t <timeout in seconds>]

If you want to specify a particular config file then use -c as above, otherwise it will use the default file name.

The tool will process all the items configured in your file, and will try to reindex each of them against whatever indexes are defined in your config. There are some extra parameters that can allow you to tweak the rebuild process as it runs.

The tool keeps a rolling average of how long it’s taking to rebuild each index entry, and tries to predict its likely run-time based on these numbers. It will show you an estimate every 10 items by default, but you can adjust that with the -o parameter.

If some sort of error happens while rebuilding an item, the tool has a retry process. By default it will re-try an item 5 times before deciding that the error is permenant rather than transient. The tool will back off after each error, to allow some time for transient errors to recover. It backs off for an increasingly long time for each repeated error – so it should be able to back off long enough to survive an IIS Reset without considering it a permenant error. If you want to change the retry count then pass -r to change the default. This might be useful if your site takes a particularly long time to come back up after it recycles. Errors are logged internally, and are written to disk when the tool finishes, so that you can see exactly which items caused issues.

If you want to limit the CPU impact of the rebuild process on your server then the -p parameter lets you specify how many milliseconds to wait between indexing requests. You can also use -t to set the timeout for indexing an item.

Normally the tool will try to run all the indexing operations in one go. But if you need to stop the process for some reason, then hit Ctrl-C. The tool will finish the item it’s working on and then stop. It will save its state to disk when it finishes. The errors get written to disk, along with a revised copy of the config file that reflects the progress made. It keeps a backup of the previous state as well. You can restart the process using this updated config file later.

You can also generate new job config from the errors of a previous run if you need to. This, and some assorted other options are written up in the readme.

Want to try it?

Go ahead – you can grab the latest build of the code from Github. If you want to know how it works, then the code for this is available on Github too. It’s not particularly pretty, but you’re free to poke around and suggest improvements if you fancy. There’s some extra information in the readme for that repo.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.