Thinking about importing content?

We spend a lot of time worrying about the marketing content, and the general website text and images in Sitecore. A lot gets said about patterns for organising that content. But some projects have information that comes from external systems that needs to be rendered on the website. And plenty of sites choose to integrate that into their main content tree. Over the years I’ve bumped into a few problems because of this – usually because I find myself supporting something where poor decisions were made early in the design process for the integration. So here’s some things to think carefully about if you’re planning work that relies on back-end data:

The key thing that’s caused more problems than any other is misunderstanding the volume of content you’re going to be dealing with. You need to think carefully about that before you get started, because it can trip you up. How much will there be at the start, when the integration first runs? And how is this likely to change over time? Will it tend to grow, or will the number stay roughly level?

Pain points…

If you know your content is small and you know it will stay that way, then writing it into a folder (or a small set of folders) is probably ok. If you can keep within the “never have more than 100 children under any item” rule for the Sitecore content tree then things should be ok. But if your content is large, or is likely to grow over time, then you need to give serious consideration to how you’re going to store it from the beginning. While it might seem like overkill to start with, the problems with large folders of content can become significant. Broadly the key thing to worry about is “do I really need this content inside Sitecore?” If you do, then you need to look at buckets. If you don’t, then you need to plan how your code will access for the data. For the moment, lets assume the data is going into Sitecore.

Either way, you need to think about the speed of your integration process – it’s relevant whether the data is in Sitecore or not but for different reasons. With data imported into the content tree, you need to consider how long each item operation for insterting / updating or deleting will take. With a sensibly written integration you want to be processing as few items in each update cycle as you can, so the time per item may not be that critical once it’s up and running. But the initial import, or a situation where many items are changed at once (what would happen if you needed to change the schema, for example) will lead you to process a lot of items. And that can lead to long execution times. Consider how this code can be optimised. Using things like BulkUpdateContext and related techniques to keep the per-item times as low as possible will pay dividends.

If the data isn’t being imported, then you’re fetching it from its external source when it’s required. Again, the per-operation timing can be really important here, but for different reasons. In this case it’s going to be a limiting factor on how hard your page can refresh. Again, try to build your code so that it’s as fast as reasonably practical, but you probably also need to give consideration to how caching can work with the process to fetch this content data. Maybe whatever service you’re fetching data from can have a cache layer built in? Maybe you need to consider how you might use a cache to reduce the number of times the data service is called from the website’s side?

Data in Sitecore?

Large volumes of imported content have a lot of potential to cause you some problems if you’re not careful. On top of the challenge above about expanding the content tree when you have too many children under items, you can end up with publishing and indexing difficulties. When they’re not sure what to publish, editors are annoyingly likely to try “Publish Site” – and big collections of imported content can lead to long publish times. There may be an argument for integrating directly with the web database here, to avoid the need for publishing. But that might not be easy if editors need to make relationships to the imported content when editing in the master database.

When you’re dealing with imported content, think carefully about the effect any computed fields in your search index may have. What work is required to compute a value? How often would his value change? Would it be better to compute this data at import time, so it only needs to be done when the source data changes? Much like the original importing, less work at indexing time will lead to fewer problems overall.

Once issues with indexing speed are found, it’s a common pattern to try and solve them by changing the definition of the Sitecore master/web indexes to include only a fragment of the content. While this will indeed make index rebuilds faster, it can cause a variety of other issues – the most obvious of which is that search in Content Editor will only work if you query while an item that is indexed is selected.

There’s a whole separate set of challenges if the content items being integrated are part of the media library. Publishing issues can get worse, as the volume of data that needs to be moved increases a lot with media. And the underlying database will get dramatically bigger – the blob table (where the binary data for the media items is stored) can easily come to dominate the disk. Once you’re dealing with 30GB-plus databases you can find that backups and synchronising content from production back to developers become much harder as well. And that tends to lead to problems where QA tests are not realistic.

A key factor in ensuring the item count doesn’t run away over time, is to make sure that the integration process deals with deletions and updates properly as well as inserts. If an existing item is no longer needed, the integration process should remove it. Given the data is computed, not “editor created” it should probably skip the recycle bin too. That way the content tree doesn’t grow any more than absolutely necessary. Should the removed item become necessary again, the integration process can re-insert it. For updates, think very hard about whether you want item versions or not – you probably don’t need them, because “reverting to a previous state” should happen in the system that owns the content, not in Sitecore. And if you have them, their number is going to grow over time. If you can come up with a business rule for the maximum number (or age) of historic versions, then do make your import process remove any versions that don’t meet these criteria.

And if you do plan to pull your content into Sitecore, have a think about what tech you’re going to use to do the integration processing. Sitecore’s answer would be the Data Exchange Framework. There’s a learning curve associated with that, but it provides a lot of the core functionality for you. You only need to provide the endpoint for your back-end systems and some configuration work. That may be an advantage for you. However for simpler scenarios it may be more complex than you need.

If you do decide to roll your own framework, the question of scale comes up again. The simplest answer of “put the integration into a web page that can be called for each update” has its own scaling challenges. A simple web request like this is very easy to get started with, but it’s not designed for long-running operations. As content grows you will find it risks timing out – and just turning up the ASP.Net timeout settings is almost certainly the wrong apporach here. To get reliability for larger integrations, you almost certainly have to take the controlling process out of your website. A Windows Service, or a scheduled task on your server may be more appropriate here – and they can call an endpoint on your site which performs individual integration operations that don’t risk timeouts. (Avoid the temptation to write directly to the Sitecore databases!) Code will have to cope with the website not responding, however. Think about how retries and logging will be managed.

Data not in Sitecore?

The alternatives to putting data into Sitecore have their own set of pros and cons to consider. The two most common ideas here are some sort of external web service to provide data on demand, or a Data Provider in Sitecore.

Data Providers allow external information to be presented to Sitecore’s internal data APIs, so that it can appear in the content tree without actually being part of the content database. This gets rid of the need for an integration process to pull the data in, but the provider code itself can be complex. The information archictecture and performance of this code is still important, as challenges with speed of processing to do things like expand the content tree remain.

It’s also worth considering that in the future, Data Providers are unlikely to be supported after the transition to a SaaS model for Sitecore. They’re tied into configuration and APIs at a low level – things which we probably won’t be able to change in a SaaS deployment.

As mentioned above, building an API to fetch the data directly from the back-end system is a potential approach here. Rather than having the data in the content tree, just read the right bits at the point you need to render them. It’s a different set of performance challenges – as you want to optimise for the fastest return of that single record of data, but it has some potential advantages to look at. Keeping the extra content out of Sitecore’s tree gets you away from most of the content scale challenges discussed above. You still need to worry about the performance of fetching the data, and about how you optimise those fetches via caching. But both server-side and client-side rendering of UI components can take advantage of this approach fairly simply. Where the client-side approach is practical, it is another way to shift load off your web server, and allow it to scale to larger sites. Modern client-side UI frameworks can make binding data from an API very simple.

The other option to consider is pushing your external data directly into a search index. It’s fast to fetch data from Solr or Azure search, and it has some key advantages if you need to filter or sort data at runtime. You’ll need a mechanism to push changes from your raw data into the search index here, but that may well be faster and easier than pushing it to Sitecore. Search engines have well developed HTTP-based APIs for importing data that can work well with your own import code. They may also have features for directly monitoring and updating from database tables.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.