Custom Sitemap Files–Part Two

Last week we looked at the stuff to create in Sitecore to configure a custom sitemap generator. This week we’ll carry on and look at the basic proof-of-concept code that can be used to process that configuration and generate a sitemaps and sitemap index files. It’s another epic post…

The sitemap generation process starts when Sitecore triggers the custom event handler class we defined. When that runs we need to get the configuration defined for the module and iterate it:

public void Publish(object sender, EventArgs args)
{
    Database master = Sitecore.Configuration.Factory.GetDatabase("master");
    Item configRoot = master.GetItem(Identifiers.ConfigItemID);

    string query = string.Format("child::*[@@templateid='{0}' or @@templateid='{1}']", Identifiers.SitemapFileTemplateID, Identifiers.SitemapIndexFileTemplateID);

    foreach (Item siteConfig in configRoot.Axes.SelectItems(query))
    {
        if (siteConfig.TemplateID == Identifiers.SitemapIndexFileTemplateID)
        {
            publishSiteIndex(siteConfig);
        }

        if (siteConfig.TemplateID == Identifiers.SitemapFileTemplateID)
        {
            publishSite(siteConfig);
        }
    }
}

Here we’re fetching a reference to the Master database by using Sitecore’s internal configuration factory. This is necessary because there is no context database available here. Then we use that database to load the root configuration item we talked about last week. The ID of the item we need to load is defined in a static class. (It’s a common pattern in Sitecore development to define a class in which you put the IDs of items and fields that you will need to reference later. You can generate these using things like Hedgehog TDS and T4 templates, or you can maintain them manually – all of the IDs in this example code will be defined that way) We can then run a query on that item to find all its children which are either Sitemap definitions or Sitemap Index File definitions, and iterate those.

If we find an item based on the Sitemap Index File template then we need to process it:

private void publishSiteIndex(Item siteIndexConfig)
{
    SiteIndexConfiguration sc = new SiteIndexConfiguration(siteIndexConfig);

    SitemapIndex si = new SitemapIndex();

    var urlOptions = new Sitecore.Links.UrlOptions()
    {
        AlwaysIncludeServerUrl = true,
        LanguageEmbedding = Sitecore.Links.LanguageEmbedding.AsNeeded,
        ShortenUrls = true
    };

    foreach (Item s in sc.SiteConfigurations)
    {
        SiteConfiguration cfg = publishSite(s);

        Item root = cfg.SitemapSourceDatabase.GetItem(cfg.SitemapRootItem);
        string rootUrl = Sitecore.Links.LinkManager.GetItemUrl(root, urlOptions);
        Uri rootUri = new Uri(rootUrl, UriKind.Absolute);

        Uri fileUri = new Uri(rootUri, "/" + cfg.SitemapFilename);

        si.Add(new SitemapIndexItem() { 
            Location = fileUri.ToString(),
            LastModified = DateTime.Now
        });
    }

    string filename = System.Web.Hosting.HostingEnvironment.MapPath("/" + sc.SitemapIndexFilename);
    saveXmlToFile(filename, si.Serialise());
}

This method gets called passing in the item that represents the Sitemap Index file configuration that an editor has set up. So the first thing to do is parse it, in order to get the configuration into a strongly typed class. In the real world this is best done with a tool like Glass Mapper but to keep things simple and avoid dependencies here I’ve just used a simple model class:

public class SiteIndexConfiguration
{
    public string SitemapIndexFilename { get; private set; }
    public IEnumerable<item> SiteConfigurations { get; private set; }

    public SiteIndexConfiguration(Item cfg)
    {
        SitemapIndexFilename = cfg.Fields[Identifiers.SitemapIndexFilenameFieldID].Value;

        var siteConfigs = cfg.Axes.SelectItems(string.Format("./*[@@templateid='{0}']", Identifiers.SitemapFileTemplateID));
        if (siteConfigs != null && siteConfigs.Length > 0)
        {
            SiteConfigurations = siteConfigs;
        }
        else
        {
            SiteConfigurations = new List<item>();
        }
    }
}

It just extracts the relevant fields from the item into properties… This pattern will get reused throughout the rest of the code too.

With that loaded, next we set up a UrlOptions object that we’ll use to generate some links later, and then iterate each of the Sitemap file configurations that have been set up for inclusion in this index file.

For each of these configurations the code runs the publish process for that sitemap file (we’ll come back to that in a second) and gets back the configuration object from that item. We then use that configuration data to generate the information that will be written into the index file. Normally when you’re after a link to an item in Sitecore you just ask LinkManager – but here we need the URL of something that isn’t a Sitecore item. So we use LinkManager to generate the URL of the root item for the site we’re indexing, and then we use the the Uri type to merge the relative Uri for the sitemap file with the absolute Uri of the root item. The absolute URL is necessary here as the Sitemap Index file needs to specify the exact location of the child sitemaps.

With the correct link generated we can then create a SitemapIndexItem object to represent this data ready for us to serialise it into the Sitemap index file.

Finally the code generates the physical file name on the server that the Sitemap file is going to be written into, and we serialise the data. The saveXmlToFile() method just writes the XML we generate into a UTF8 text file using a TextWriter and the standard saving behaviour of the Linq to XML classes. We call the Serialise() method on the SitemapIndex data to generate an XElement which we can save. Having tried to write this code using the XmlSerialiser instead initially, I came to the conclusion that the code required to get the namespaces and null-element handling correct wasn’t worth the extra effort. Manually serialising the data to exactly the required format turned out simpler. The code for these data classes is as follows:

public class SitemapIndex
{
    private List<sitemapindexitem> _items = new List<sitemapindexitem>();

    public IEnumerable<sitemapindexitem> IndexItems { get { return _items; } }

    public static readonly XNamespace Namespace = "http://www.sitemaps.org/schemas/sitemap/0.9";

    public void Add(SitemapIndexItem itm)
    {
        _items.Add(itm);
    }

    public XElement Serialise()
    {
        XElement root = new XElement("sitemapindex");
        root.Add(new XAttribute("xmlns", Namespace));

        foreach (SitemapIndexItem i in _items)
        {
            root.Add(i.Serialise());
        }

        return root;
    }
}

public class SitemapIndexItem
{
    public string Location { get; set; }
    public DateTime LastModified { get; set; }

    public XElement Serialise()
    {
        XElement root = new XElement("sitemap");

        root.Add(new XElement("loc", Location));
        root.Add(new XElement("lastmod", LastModified.ToString("yyyy-MM-dd")));

        return root;
    }
}

Again, this pattern of having data classes that know how to serialise themselves will be repeated later…

So the next bit of code we need is the publishSite() method mentioned above – to save the data for an individual sitemap. This is a bit more complex, so we’ll break it into pieces. First of all, we’ll generate the data and store it:

private SiteConfiguration publishSite(Item siteConfig)
{
    SiteConfiguration sc = new SiteConfiguration(siteConfig);

    SitemapUrlSet sitemap = processSite(sc);

    if (sitemap != null)
    {
        string filename = System.Web.Hosting.HostingEnvironment.MapPath("/" + sc.SitemapFilename);
        saveXmlToFile(filename, sitemap.Serialise());
    }

    return sc;
}

This loads the configuration data that was passed in, calls processSite() to generate the sitemap data object, serialises it, and then saves the XML to disk in the same way we did before. The config class does a bit more this time:

public class SiteConfiguration
{
    public string SitemapFilename { get; private set; }
    public string SitemapSourceDatabaseName { get; private set; }
    public Database SitemapSourceDatabase { get; private set; }
    public ID SitemapRootItem { get; private set; }
    public IEnumerable<string> SitemapIncludeLanguages { get; private set; }
    public IEnumerable<id> SitemapIncludeTemplates { get; private set; }

    private Language getLanguage(string name)
    {
        Language l;

        if (Language.TryParse(name, out l))
        {
            return l;
        }

        return null;
    }

    public SiteConfiguration(Item siteItem)
    {
        SitemapFilename = siteItem.Fields[Identifiers.SitemapFilenameFieldID].Value;

        SitemapSourceDatabaseName = siteItem.Fields[Identifiers.SitemapSourceDatabaseFieldID].Value;

        SitemapSourceDatabase = Sitecore.Configuration.Factory.GetDatabase(SitemapSourceDatabaseName);

        ID rootItem = ID.Null;
        ID.TryParse(siteItem.Fields[Identifiers.SitemapRootItemFieldID].Value, out rootItem);
        SitemapRootItem = rootItem;

        SitemapIncludeLanguages = siteItem.Fields[Identifiers.SitemapIncludeLanguagesFieldID].Value.Split('|')
            .Where(s => !string.IsNullOrWhiteSpace(s))
            .Select(s => ID.Parse(s))
            .Select(i => SitemapSourceDatabase.GetItem(i))
            .Select(l => l.Name);

        SitemapIncludeTemplates = siteItem.Fields[Identifiers.SitemapIncludeTemplatesFieldID].Value.Split('|')
            .Where(s => !string.IsNullOrWhiteSpace(s))
            .Select(s => ID.Parse(s));
    }
}

Again, the code is extracting the fields from the configuration template for the sitemap that we defined last week. There’s a bit more effort here to get the content database we need, parse out the ID values and extract the lists of languages and templates to include. Linq projections are used to get the data in the appropriate format.

So the processSite() method looks like:

private SitemapUrlSet processSite(SiteConfiguration sc)
{
    Item rootItem = sc.SitemapSourceDatabase.GetItem(sc.SitemapRootItem);
    SitemapUrlSet sitemap = new SitemapUrlSet();

    process(sc, sitemap, rootItem);
    
    return sitemap;
}

This doesn’t do a great deal – it fetches the root item from the database, creates an empty SitemapUrlSet object to gather the data and calls process() on the root item to start generating data. This code is factored out into a recursive method, as we need to transform the tree of Sitecore content into a list of entries for the sitemap file. That process is broken up into two parts – first we generate the set of URLs for the current item (it can be more than one URL due to language versions). Second we process any children this item has:

private void process(SiteConfiguration sc, SitemapUrlSet urls, Item item)
{
    IEnumerable<sitemapurl> u = buildSitemapUrl(sc, item);
    if (u != null && u.Count() > 0)
    {
        urls.AddRange(u);
    }

    foreach (Item child in item.Children)
    {
        process(sc, urls, child);
    }
}

And then we can generate the list of URLs for an individual item:

private IEnumerable<sitemapurl> buildSitemapUrl(SiteConfiguration sc, Item item)
{
    if(sc.SitemapIncludeTemplates.Count() > 0 && !sc.SitemapIncludeTemplates.Contains(item.TemplateID))
    {
        return null;
    }

    if (item.Fields.Contains(Identifiers.SitemapIncludeFieldID))
    {
        CheckboxField cf = (CheckboxField)item.Fields[Identifiers.SitemapIncludeFieldID];
        if(!cf.Checked)
        {
            return null;
        }
    }

    List<sitemapurl> urls = new List<sitemapurl>();
    foreach (Language l in item.Languages)
    {
        if (sc.SitemapIncludeLanguages.Contains(l.Name))
        {
            Item i = sc.SitemapSourceDatabase.GetItem(item.ID, l);
            if (i.Versions.Count > 0)
            {
                SitemapUrl u = processLanguage(sc, i, l);
                if (u != null)
                {
                    urls.Add(u);
                }
            }
        }
    }
    return urls;
}

This method starts by checking if the item being processed is based on one of the templates that have been specified for inclusion into the sitemap file. If templates have been specified, but this item is not based on one of them then we skip this item by returning null. Next we check if this item includes our “should this item be included in the sitemap” flag. If it does we check its value. If it says no, then we skip the item.

Finally we need to process each of the languages our item defines. We test if this language should be included or not. If the language should be included then we re-load the item in that language and check we got a valid version. If so, we process that version to generate the Sitemap file entry data:

private SitemapUrl processLanguage(SiteConfiguration sc, Item item, Language l)
{
    SitemapUrl url = new SitemapUrl();

    var urlOptions = new Sitecore.Links.UrlOptions()
    {
        AlwaysIncludeServerUrl = true,
        LanguageEmbedding = Sitecore.Links.LanguageEmbedding.AsNeeded,
        ShortenUrls = true,
        Language = l
    };

    url.Location = Sitecore.Links.LinkManager.GetItemUrl(item, urlOptions);

    DateField df = (DateField)item.Fields[Identifiers.__UpdatedFieldID];
    url.LastModified = df.DateTime;

    if (item.Fields.Contains(Identifiers.SitemapPriorityFieldID))
    {
        float f;
        if (float.TryParse(item.Fields[Identifiers.SitemapPriorityFieldID].Value, out f))
        {
            url.Priority = f;
        }
    }

    if (item.Fields.Contains(Identifiers.SitemapChangeFrequencyFieldID))
    {
        ChangeFrequency cf;
        if (Enum.TryParse<changefrequency>(item.Fields[Identifiers.SitemapChangeFrequencyFieldID].Value, true, out cf))
        {
            url.ChangeFrequency = cf;
        }
    }

    return url;
}

As we did with the index file above, we use LinkManager to generate the correct URL for the item. And we extract all the relevant bits of data out of the website item into our SitemapUrl data object. We check if some of the fields exist before processing them, in case we’re generating data for an item which doesn’t provide our custom SitemapItemExtensions data template.

So the last thing we need is the code for the SitemapUrlSet and SitemapUrl classes that we need to serialise into our sitemap xml data:

public class SitemapUrlSet
{
    private List<sitemapurl> _urls = new List<sitemapurl>();

    public IEnumerable<sitemapurl> SitemapUrls { get { return _urls; } }

    public void Add(SitemapUrl url)
    {
        _urls.Add(url);
    }

    public void AddRange(IEnumerable<sitemapurl> urls)
    {
        _urls.AddRange(urls);
    }

    public static readonly XNamespace Namespace = "http://www.sitemaps.org/schemas/sitemap/0.9";

    public XElement Serialise()
    {
        XElement root = new XElement("urlset");
        root.Add(new XAttribute("xmlns", Namespace));
        root.Add(new XAttribute(XNamespace.Xmlns + "image", SitemapImage.Namespace));

        foreach (SitemapUrl u in _urls)
        {
            root.Add(u.Serialise());
        }

        return root;
    }
}

public enum ChangeFrequency 
{
    Always,
    Hourly,
    Daily,
    Weekly,
    Monthly,
    Yearly,
    Never
}

public class SitemapUrl
{
    public string Location { get; set; }
    public DateTime? LastModified { get; set; }
    public ChangeFrequency? ChangeFrequency { get; set; }
    public Single? Priority { get; set; }

    public XElement Serialise()
    {
        XElement root = new XElement("url");

        root.Add(new XElement("loc", Location));
        
        if(LastModified.HasValue)
        {
            root.Add(new XElement("lastmod", LastModified.Value.ToString("yyyy-MM-dd")));
        }

        if (ChangeFrequency.HasValue)
        {
            root.Add(new XElement("changefreq", ChangeFrequency.Value.ToString().ToLower()));
        }

        if (Priority.HasValue)
        {
            root.Add(new XElement("priority", Priority.Value.ToString("0.0")));
        }

        return root;
    }
}

And that’s it for the code. We can now set up a test, and try building a sitemap. If we set up a Sitemap Index:

image

And configure it to contain a single Sitemap file that outputs English and Japanese content stored in the Sample Item template:

image

Then we can set up a bit of test content and hit publish. What we find on disk afterwards is two XML files. First, in the index file we find:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>http://test/default-sitemap.xml</loc>
    <lastmod>2014-59-12</lastmod>
  </sitemap>
</sitemapindex>

And then in the sitemap itself we find:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
  <url>
    <loc>http://test/en/sitecore/content/Home</loc>
    <lastmod>2014-10-05</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.7</priority>
  </url>
  <url>
    <loc>http://test/ja-JP/sitecore/content/Home</loc>
    <lastmod>2013-09-11</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.7</priority>
  </url>
  <url>
    <loc>http://test/en/sitecore/content/Home/Global shared content/Sample</loc>
    <lastmod>2014-29-21</lastmod>
  </url>
  <url>
    <loc>http://test/en/sitecore/content/Home/Sample Page</loc>
    <lastmod>2014-12-03</lastmod>
  </url>
</urlset>

Bingo. There’s our sitemap data! It shows the appropriate data, as defined by our test items.

Now there’s quite a bit that could be done here to improve this code – but it proves that the approach works. As mentioned, the usual error checking and patterns like Glass could be applied. One other thing that struck me is that if this code was run against a large site it would be creating lots of small objects in memory before serialising them. So a good optimisation would probably be to write the serialised data directly into the output files without the need for the intermediate objects. That would allow much larger sites to be generated more efficiently.

And next week, we’ll take a look at how we can add image data to the sitemap files…

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s