More learning about Kubernetes for Sitecore

Last time out I was thinking about some choices around setting up Sitecore in Kubernetes. Since then, I’ve moved onto the more practical task of trying to get the setup to work. And I doubt you’ll be surprised to hear that I’ve met a few new issues… Maybe they’ll help you save yourself a bit of time and frustration?

Setting up AKS and ACR

I started off the setup process by taking advanatge of the scripts Bart Plasmeijer published after his recent Symposium presentation. I figured they would be a quick way to get myself a working instance of AKS so I could experiment a bit.

But the deployment failed at step four. When it tried to run “helm install nginx-ingress ingress-nginx/ingress-nginx” it timed out, and the script window reported a fairly generic error. It took me a while to work out how to get some details back from AKS about what happened, but eventually I found the “kubectl describe pod <pod-name>” command and got this:

Events:
Type     Reason                  Age    From                   Message
----     ------                  ----   ----                   -------
Normal   Scheduled               39m    default-scheduler      Successfully assigned ingress-basic/nginx-ingress-ingress-nginx-admission-create-x7nc7 to akswin000000
Warning  FailedCreatePodSandBox  39m    kubelet, akswin000000  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod
                                                               "nginx-ingress-ingress-nginx-admission-create-x7nc7": Error response from daemon: container
                                                               0db3bde3668a18bd9bf25a9ce127a094b3b4845461e24950fa3bb31d293a26df encountered an error during
                                                               hcsshim::System::CreateProcess: failure in a Windows system call: The user name or password is incorrect.
                                                               (0x52e) extra info: {"CommandLine":"cmd /S /C pauseloop.exe", "User":"2000", "WorkingDirectory":"C:\\",
                                                               "Environment {"PATH":"c:\\Windows\\System32;c:\\Windows"}, "CreateStdInPipe":true, "CreateStdOutPipe":true,
                                                               "CreateStdErrPipe":true, "ConsoleSize":[0,0]}

Initially I was worried that I’d broken something in the scripts here – I’d made a few edits to make them match my needs more closely. So I spent quite a lot of time trying to find helpful stuff in Google, and bugging people on Sitecore Slack. But eventually I realised that this was not actually the first error. Somehow I’d missed the fact that another error earlier in the process (at step two) that had scrolled off my console screen when I wasn’t looking. And this turned out to be the important one:

--- Linking AKS to ACR ---
az : ForbiddenError: The client 'my.user@company.com' with object id 'xx56b90b-xxxx-xxxx-xxxx-98ea3aa7xxxx' does not have 
authorization to perform action 'Microsoft.Authorization/roleAssignments/write' over scope '/subscriptions/xxd1df1a-xxxx-xxxx-xxxx-dfdbd355xxxx/resourceGroups/client-k8s/providers/Microsoft.ContainerRegistry/registries/client/providers/Microsoft.Author
ization/roleAssignments/xx5f2920-xxxx-xxxx-xxxx-9a8efa47xxxx' or the scope is invalid. If access was recently granted, please 
refresh your credentials.
At C:\Users\JDavis\Downloads\Sitecore-Symposium-2020-Containers-AKS-main\2.CreateAKS.ps1:53 char:1
+ az role assignment create --assignee $clientID `
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (ForbiddenError:...ur credentials.:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

It’s saying it cannot connect my AKS instance to my ACR instance because I don’t have sufficient permissions to set the security for this. So cue a load more Googling and pestering on Slack, and JF Larente pointed out the important thing I’d missed: Microsoft’s Azure documentation says you need to have owner rights for either the subscription or the ACR itself for this task to work.

Because I was doing this in the company Azure subscription, my account was fairly tightly controlled and I didn’t have Owner rights to anything. So the script failed to connect AKS to my ACR, and hence the Nginx deployment failed for some reason related to this.

Once I got Owner rights applied to my account, tidied up my resource group and ran the scripts again, these errors disappeared.

Pay attention to your node specs!

Another mistake was a classic case of “read other people’s scripts properly before you run them!”. When the scripts got to this part of step two:

# Add windows server nodepool
Write-Host "--- Creating Windows Server Node Pool ---" -ForegroundColor Blue
az aks nodepool add --resource-group $ResourceGroup `
    --cluster-name $AksName `
    --os-type Windows `
    --name win `
    --node-vm-size Standard_D8s_v3 `
    --node-count 1 
Write-Host "--- Complete: Windows Server Node Pool Created ---" -ForegroundColor Green

it failed with an Azure error I’d never seen before: “Operation could not be completed as it results in exceeding approved Total Regional Cores quota“. Turns out that Azure has some default settings to avoid you spinning up a big pile of expensive VMs and hammering your credit card.

What I’d failed to notice was the bit of script above is adding a “D8s_v3” VM – meaning it has eight cores. Because I already had four cores running in this region, Azure said no to adding this big VM as my new node.

So lesson learned: Look at the script and check what you’re creating before you start. So I changed the VM size being created and went back to work…

Helm trips me up…

With those issues sorted I was able to get AKS up and running, so I turned my attentions to the Helm Charts to get my site deployed to AKS. The first thing that tripped me up here was Solr. For the Docker-hosted development instances of this site I have a customised SolrCloud container which has a setup script that will create all the collections at start-up if required. What I only realised after a load of “Solr isn’t working” pain is that this start-up script had a custom entrypoint in my Docker compose file. And that meant it needed something similar for Helm so that Kubernetes would run my custom script too.

After a few rounds of googling and a certain amount of trial and error, I worked out that the custom entrypoint:

  solr:
    image: ${REGISTRY}${COMPOSE_PROJECT_NAME}-k8s-xp1-solrcloud:${VERSION:-latest}
    build:
      context: ./containers/build/solrcloud
      args:
        BASE_IMAGE: ${SITECORE_DOCKER_REGISTRY}sitecore-xp1-solr:${SITECORE_VERSION}
    mem_limit: 1GB
    entrypoint: powershell -Command "& C:\Cloud\StartCloud.ps1 c:\solr c:\data"
    volumes:
      - ${LOCAL_DATA_PATH}\solr:c:\data

needs to get replaced with this for Kubernetes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: solr
  labels:
    app: solr
spec:
  replicas: 1
  selector:
    matchLabels:
      app: solr
  template:
    metadata:
      labels:
        app: solr
    spec:
      nodeSelector:
        kubernetes.io/os: windows
      containers:
      - name: solr
        image: client.azurecr.io/my-k8s-xp1-solrcloud:latest
        command: ["powershell.exe"]
        args: ["-Command", "& C:\\Cloud\\StartCloud.ps1 c:\\solr c:\\data"]
        ports:
        - containerPort: 8983
        env:
        - name: SOLR_MODE
          value: solrcloud
      imagePullSecrets:
      - name: sitecore-docker-registry

But then I crashed into a whole new issue. With Solr and SQL services up and running, I tied to run the SQL initialisation job in step seven of Bart’s script. Its job is to create all the databases in SQL. But what I actually got was another failure. See the next section for info about how I found the actual error, but it was complaining about being unable to log in as the SQL “sa” account to create the databases.

This confused me a lot. Kubernetes uses “secrets” as a way to store things like passwords – so surely the SQL service was being started up using the same value for the SA password that the init job was trying to log in with? They are both reading the same secret after all…

The Sitecore deployment config for Kubernetes gives you a folder full of text files for your secrets. The SQL username and password come from “secrets/sitecore-databaseusername.txt” and “secrets/sitecore-databasepassword.txt“. You publish these secrets into your AKS cluster, and the contents of these files get put into the secret store for use by your containers.

When you download them, the username file contains “sa” by default, and the password one is empty. So I’d filled in a suitable password. Opening my copies in a text editor looked fine at first glance:

But after tearing my hair out for a couple of hours I spotted the length field at the bottom left of that image: 3 charactes… For a two character username… So I made whitespace visible in the editor:

And that extra line-feed was the clue to issue. I went through all my secrets files and got rid of any accidental trailing whitespace, and then pushed the secrets files up to my AKS instance again. And with that done deleting/recreating SQL and then running the init job succeeded.

 
I’m getting closer to a working build-and-deploy process, but I suspect I’ll have some more things to write up in the future…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.