A thought experiment in text parsing

Every so often, every developer finds themselves having to parse data out of text. There are loads of ways of approaching this task, but a lot of pretty unsatisfactory solutions start from “I’ll just split up the text by whitespace” or “Hey, let’s use regular expressions!”. You all remember what regular expressions lead to, right?

As someone who’s always on the lookout for something interesting and new to experiment with, I came across an alternative approach to parsing text recently. A blog post I read (I forget what it was, or I’d give credit) linked to the Sprache project on GitHub. This is a text parsing library which lets you construct the descriptions of the text to parse using Linq-style expressions.

So I thought I’d invent an idea for an experimental project to test this out…

What was the idea?

Imagine your client phones you up one day and says “Our SEO people have sent me a text file full of changes we should make to the structure of our content. We need you to implement all the changes right away!”. Not a million miles from something that might happen in reality… You could make all the changes by hand, or maybe you could automate them? Well the right answer is almost certainly something involving PowerShell Extensions, but for the purposes of having an excuse to learn about Sprache, how about this:

Lets assume there are three operations described in the text file, and that the SEO people were nice enough to use a standard pattern for each one:

  • Delete an item
    delete <item>
  • Move an item
    move <item> to <location>
  • Create an item
    create <template> named <name> under <location> [with <fieldName>="<value>"[,etc]]

What we’d like our automation to do is take a list of these commands from some text, and then attempt to process each one in turn. What code do we need to achieve that?

First job: Parsing an item or path to operate on

Taking a TDD type approach to this, what are the initial tests we might want for parsing a delete command? Well looking at the example above we need to parse the “delete” operation and we need to parse out the defintion of an item to delete. Sitecore usually identifies items by paths or IDs, so that tends to suggest we need to be able to parse either a path or a GUID here.

As you might expect from native .Net code, Sprache is a strongly typed parser. So if we’re going to want to parse a “GUID or Path” field, we need to be able to parse the text into a type holding this data. A test case might look like this for GUIDs:

[TestMethod]
public void ItemIDParsesValidID()
{
    string guid = "{582ccf36-b6e4-49f0-9c35-2d8e40b5ef3d}";
    var result = CommandTextParser.ItemId.TryParse(guid);

    Assert.IsTrue(result.WasSuccessful, result.Message);
    Assert.AreEqual(Guid.Parse(guid), result.Value.Id);
    Assert.AreEqual(string.Empty, result.Value.Path);
}

Or this for paths:

public void ItemPathParsesForValidPath()
{
    string path = "/alpha/bravo";

    var result = CommandTextParser.ItemPath.TryParse(path);

    Assert.IsTrue(result.WasSuccessful, result.Message);
    Assert.AreEqual(path, result.Value.Path);
    Assert.AreEqual(Guid.Empty, result.Value.Id);
}

So what code do we need to make these tests pass? Well, Sprache tends to have its parsing methods defined as statics to allow composing them easily. So we’ll need a static class to hold our parsing code, and it’s going to need our “ItemPath” parser from the test above:

public static class CommandTextParser
{
    public static Parser<ItemIdenitfier> PathOrID = ??
}

And we’re going to need that class our test implies to encapsulate our output:

public class ItemIdenitfier
{
    public string Path { get; set; }
    public Guid Id { get; set; }
}

So with the boilerplate out of the way, what do we actually want to do with the parsing? In English, we want to parse “A path, or a GUID, and maybe any surrounding whitespace”. Sprache makes that pretty easy to express, because you can build more complex expressions out of simpler ones:

public static class CommandTextParser
{
    public static Parser<ItemIdenitfier> PathOrID =
        ItemPath
        .XOr(ItemId)
        .Token();
}

We’ll get to defining the “ItemPath” and “ItemID” parser code in a bit. This expression just uses an exclusive Or to bind them together – so says “one or the other, but never both” and the Token() method is shorthand for telling Sprache “ignore any surrounding whitespace”.

The Item ID parser is the simpler of the two, so lets start with that. A simplestic definition of a GUID is “an opening curly brace, followed by a string of letters, numbers or hyphens, and ending with a closing curly brace”. That’s not a particularly rigerous definition – but we can live with that for the moment. A parser for that can be written as:

public static Parser<ItemIdenitfier> ItemId =
    from openBrace in Parse.Char('{')
    from id in Parse.LetterOrDigit.Or(Parse.Char('-')).Repeat(36).Text()
    from closeBrase in Parse.Char('}')
    select new ItemIdenitfier() { Path = string.Empty, Id = Guid.Parse(id) };

This uses Linq syntax to find specific things and assign them to local variables. First we find the open curly, then we find the body of the GUID, followed by the closing curly brace. The body is a run of 36 characters, each of which can be only a letter, a digit or a hyphen. The final Text() method turns the IEnumerable<char> returned by the Repeat() into a string. Then finally we use the local variables we captured to populate one of our ItemIdentifier objects to return.

You could write a more rigerous version of this that makes sure that the right pattern of characters and dashes is presented – but I’ll leave that as an exercise to the reader…

It’s way more readable than the equivalent regular expression, eh?

Parsing a path could be done by saying “A path is any string made up of slashes and valid item name characters” but that probably allows paths with double slashes, which don’t seem relevant to a Sitecore solution. Instead, the definition is probably that a path is “One or more sets of ‘A slash followed by some valid item name characters'”. Following the pattern of building up complex behaviour from simpler parsers, we can define an item name as:

public static IEnumerable<char> InvalidNameCharacters = new List<char> { '\\', '/', ':', '?', '"', '<', '>', '|', '[', ']', ' ', '!', '(', ')', '%', '#', '@', '!', '£', '$', '^', '&', ';', '~' };

public static Parser<string> ItemName =
    Parse.CharExcept(InvalidNameCharacters).AtLeastOnce().Text().Token();

We declare a set of characters that aren’t valid in names (your ideal set may vary – that’s just an example) and we say that a name is “a string containing anything but these characters”.

From there, we can say that one segment of a path is a slash followed by a name:

public static Parser<string> PathSegment =
    from slash in Parse.Chars(new char[] { '\\', '/' })
    from name in ItemName
    select name;

And then we can define an entire path in the following form:

public static Parser<ItemIdenitfier> ItemPath =
    from parts in (
        from firstSegment in PathSegment
        from otherSegments in PathSegment.Many()
        select firstSegment.Concatenate(otherSegments)
    )
    from trailingSlash in Parse.Char('\\').Optional()
    select new ItemIdenitfier() { Id=Guid.Empty, Path = "/" + string.Join("/", parts) };

This says “First match one path segment. Then optionally match many more path segments” and it then returns a list of the names in the path segments found. Optionally there may be a trailing slash on the end of the whole thing. And the result of the parsing is an ItemIdentifier where the Path is buit up by joining those item names back together again.

And with that code in place, the two tests defined above should now pass.

Building on that to parse the delete operation

Now we can parse a path, we can move on to parsing the delete operation. We’re going to want a different output from the parser here – we’re going to need a class to represent a deletion. However, in the future we’re going to want to create a single parsing method that can parse any of our three commands – so the delete command parser will need to be defined in terms of a more generic “any command” type. To me this smells like a situation for the Command pattern, as later we’re going to need this class to know how to process the deletion.

So we can define a Commmand pattern interface:

public interface ICommand
{
    string Execute();
}

And we can then define our delete command object in terms of that:

public class DeleteCommand : ICommand
{
    public ItemIdenitfier Item { get; set; }

    public string Execute()
    {
        // fill this in later
    }
}

Parsing the data for that is now pretty easy:

public static Parser<ICommand> DeleteCommand =
    from cmd in Parse.IgnoreCase("delete").Token()
    from item in PathOrID
    select new DeleteCommand() { Item = item };

The command always starts with “delete” and we don’t care about case or white space. And it’s always followed by an identifier which might be a GUID or a path. From there we can return a DeleteCommand with that item in it, ready for processing…

And we can check this with some further tests. Things like:

[TestMethod]
public void DeleteParsesWithPath()
{
    var result = CommandTextParser.DeleteCommand.TryParse(@"delete \abc\def\x");

    Assert.IsTrue(result.WasSuccessful, result.Message);

    var cmd = result.Value as DeleteCommand;

    Assert.AreEqual("/abc/def/x", cmd.Item.Path);
}

[TestMethod]
public void DeleteParsesWithID()
{
    var result = CommandTextParser.DeleteCommand.TryParse(@"delete {7f700de5-9f24-4a30-a4f7-ed3421abb563}");

    Assert.IsTrue(result.WasSuccessful, result.Message);

    var cmd = result.Value as DeleteCommand;

    Assert.AreEqual(Guid.Parse("{7f700de5-9f24-4a30-a4f7-ed3421abb563}"), cmd.Item.Id);
}

And with all the code above in place these will pass too.

Now to actually delete…

Writing code in Sitecore to delete an item isn’t hard – but sticking to our pattern of having tests to validate the code is less easy. Our saviour here comes in the form of the Sitecore.FakeDb project. This allows you to run an “in-RAM” version of the Sitecore data APIs in your test project.

It’s very easy to install: All you need is the NuGet package added to your test project, a few DLLs from a valid Sitecore install (which can’t go in NuGet for licensing reasons) and your Sitecore license file.

Once you have that in place, you’re free to write tests against Sitecore item data like this:

[TestMethod]
public void ValidDeleteWorks()
{
    CommandProcessor cp = new CommandProcessor();
    using (Db db = new Db())
    {
        var f1 = new DbItem("Folder1");
        var p1 = new DbItem("Page");
        f1.Add(p1);
        db.Add(f1);

        Assert.AreEqual(1, f1.Children.Count);

        DeleteCommand cmd = new DeleteCommand();
        cmd.Item = new ItemIdenitfier() { Path="/sitecore/content/Folder1/Page" };

        var result = cmd.Execute();

        Assert.IsTrue(result.StartsWith("deleted", StringComparison.InvariantCultureIgnoreCase));
        Assert.AreEqual(0, f1.Children.Count);
    }
}

It creates an instance of the “fake” database and then adds a folder containing an item. It creates a DeleteCommand which specifies the item under the test folder, and then it runs the command.

What do we need to make that test pass? Just the implementation of the Execute() method from our Command pattern above. Something like:

public class DeleteCommand : ICommand
{
    public ItemIdenitfier Item { get; set; }

    public string Execute()
    {
        var itm = Sitecore.Context.Database.GetItem(Item.ToString());
        if (itm == null)
        {
            throw new ArgumentException("The item " + Item.ToString() + " was not found", "cmd.Item");
        }

        itm.Delete();

        return "Deleted " + Item.ToString();
    }
}

Note how the code can refer to Sitecore.Context.Database and FakeDb magically replaces that with the in-memory database for the purposes of this test.

Easy, huh?

Moving items

Creating the move command is pretty similar – it just needs to parse out two item identifiers because you need the item and the target folder:

public static Parser<ICommand> MoveCommand =
    from cmd in Parse.IgnoreCase("move").Token()
    from item in PathOrID
    from to in Parse.IgnoreCase("to").Token()
    from newLocation in PathOrID
    select new MoveCommand() { Item = item, NewLocation = newLocation };

And the MoveCommand class is similarly easy:

public class MoveCommand : ICommand
{
    public ItemIdenitfier Item { get; set; }
    public ItemIdenitfier NewLocation { get; set; }

    public string Execute()
    {
        var folder = Sitecore.Context.Database.GetItem(NewLocation.ToString());
        if (folder == null)
        {
            throw new ArgumentException("The item " + NewLocation.ToString() + " was not found", "cmd.NewLocation");
        }

        var item = Sitecore.Context.Database.GetItem(Item.ToString());
        if (item == null)
        {
            throw new ArgumentException("The item " + Item.ToString() + " was not found", "cmd.Item");
        }

        item.MoveTo(folder);

        return "Moved " + Item.ToString();
    }
}

Again we can write tests to validate both of these. (More of that later)

Creating items

The create command is a bit more complex. It needs to parse out a template identifer, a location and a name. Then it has to optionally parse any other field values that need setting. The parsing code for that is a bit more complex:

public static Parser<ICommand> CreateCommand =
    from cmd in Parse.IgnoreCase("create").Token()
    from template in PathOrID
    from named in Parse.IgnoreCase("named").Token()
    from itemName in ItemName.Token()
    from under in Parse.IgnoreCase("under").Token()
    from location in PathOrID
    from fieldValues in (
        from with in Parse.IgnoreCase("with").Token()
        from fields in (
            from first in Field
            from rest in Parse.Char(',').Token().Then(_ => Field).Many()
            select first.Concatenate(rest))
        select fields
        ).Optional()
    select new CreateCommand() { Template = template, Name = itemName, Location = location, Fields = fieldValues.GetOrElse(new List<Field>()) };

The first few bits, which parse out the type of command, the template, the name and path are all much as before. The more complex bit comes from parsing the optional set of field values.

Firstly, the from clause for the “with name=value[,etc]” part of the command is wraped up with a Optional() method to mark that this whole part of the command may not be there.

Within that, it repeats the “one or more of something” pattern of parsing out the first field value, followed by zero or more of “a comma, then another field”. And then it pulls together all of the field definitions into a list, which is used in the select clause to initialise the CreateCommand. Note the use of the GetOrElse() extension method. This says “if fieldValues is null, return an empty list of fields instead”. This is required because if the parser used the Optional() clause mentioned above then it will not intialise the fieldValues object. This is an easy pattern to use in these Linq-style expressions – but take care that this “method call on a potentially null object” pattern can confuse people sometimes.

The Field parser needs to deal with two different scenarios. It has to be able to parse a FieldName="value" string where there are no spaces in the name of the field. But some fields do have spaces, hence it needs to be able to parse a quoted field name in the format "Field Name"="value" too. As above, we can do that by building up smaller parsers:

public static Parser<string> QuotedFieldName =
    from openQuote in Parse.Char('"')
    from value in Parse.CharExcept('"').Many().Text()
    from closeQuote in Parse.Char('"')
    select value;

public static Parser<string> UnquotedFieldName =
    from name in Parse.CharExcept(new char[] { '=', ' ' }).Many().Text()
    select name;

public static Parser<Field> Field =
    from name in QuotedFieldName.XOr(UnquotedFieldName)
    from equalSign in Parse.Char('=').Token()
    from openQuote in Parse.Char('"')
    from value in Parse.CharExcept('"').Many().Text()
    from closeQuote in Parse.Char('"') 
    select new Field() { Name = name, Value = value };

The Field type here just stores name / value pair to describe the field:

public class Field
{
    public string Name { get; set; }
    public string Value { get; set; }
}

The CreateCommand class follows the same patterns as the other commands, but it’s a bit more complex:

public class CreateCommand : ICommand
{
    public ItemIdenitfier Template { get; set; }
    public string Name { get; set; }
    public ItemIdenitfier Location { get; set; }
    public IEnumerable<Field> Fields { get; set; }

    public string Execute()
    {
        TemplateID tid;
        if (Template.Id != Guid.Empty)
        {
            tid = new TemplateID(new ID(Template.Id));
        }
        else
        {
            var ti = Sitecore.Context.Database.GetTemplate(Template.Path);
            tid = new TemplateID(ti.ID);
        }

        var folder = Sitecore.Context.Database.GetItem(Location.ToString());
        if (folder == null)
        {
            throw new ArgumentException("The item " + Location.ToString() + " was not found", "cmd.Location");
        }

        var item = folder.Add(Name, tid);

        if (Fields != null)
        {
            item.Editing.BeginEdit();
            foreach (var field in Fields)
            {
                item[field.Name] = field.Value;
            }
            item.Editing.EndEdit();
        }

        return "Created " + item.Paths.Path;
    }
}

And as before, tests can run against all these bits of code to verify they work correctly…

Finally, parse any command

Now we’ve got parsers for each of the individual commands, it’s trivial to parse “any command”:

public static Parser<ICommand> Any =
    CreateCommand
    .XOr(MoveCommand)
    .XOr(DeleteCommand);

Any string it parses must match exactly one of our command parsers. And you can write tests to validate this works correctly, like:

[TestMethod]
public void AnyCanParseCreate()
{
    var result = CommandTextParser.Any.TryParse("create \\12 named alpha under {582ccf36-b6e4-49f0-9c35-2d8e40b5ef3d} with a=\"2\"");

    Assert.IsTrue(result.WasSuccessful, result.Message);
    Assert.IsInstanceOfType(result.Value, typeof(CreateCommand));

    var cmd = result.Value as CreateCommand;

    Assert.AreEqual("/12", cmd.Template.Path);
    Assert.AreEqual(Guid.Parse("582ccf36-b6e4-49f0-9c35-2d8e40b5ef3d"), cmd.Location.Id);
    Assert.AreEqual("alpha", cmd.Name);
    Assert.AreEqual(1, cmd.Fields.Count());
    Assert.AreEqual("a", cmd.Fields.First().Name);
}

Wrapping up…

That was the point at which I figured I’d got as much learning out of this example as I could, and moved on to other things.

It’s a contrived example, but hopefully it shows that when you do need to parse text into structured data, there are better choices than Regular Expressions for C# developers. While I’ll admit that working out how to use the Sprache API can be a bit challenging to begin with, the end result is much more usable, testable and readable than anything a regex will give you. And as a side effect, hopefully the code shows just how easy and helpful the Sitecore.FakeDb project is for enabling unit test of your API-level code.

In case you’re interested in running the tests yourself, or delving through this code in more detail, I’ve posted the whole test project to GitHub. Feel free to clone and mess about…

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s