Cleaning up Azure Container Registry

Goal: remove unwanted images from container registry to save some space

Problem: how to determine if image is safe to delete?

Tag strategy: in our case all images are tagged like so 2.1.[build-num]-[commit-sha]-[branch-name], so image may have tags like 2.1.23-9c4b0a9-feature1 for feature1 branch, 2.1.24-3e158f8-main for main branch

In our case we going to keep:

  • images with latest tag
  • images used in any cluster
  • last 10 images for main branch
  • last 10 images for any branch
  • last 3 months

Here is blueprint for the script

$registry = 'mactemp' # mactemp.azurecr.io

# Step 1: retrieve used images from all clusters
# kubectl get all -A -o json | jq -r '.. | .image? // empty' | sort | uniq | wc -l
$used = @()
foreach($cluster in @('dev', 'test', 'prod')) {
    kubectx $cluster
    foreach($tag in kubectl get all -o json | ConvertFrom-Json | Select-Object -ExpandProperty items) {
        $used += $tag | ConvertTo-Json -Depth 100 | Select-String -Pattern '"image": "(.+)"' | Select-Object -ExpandProperty Matches | Select-Object -ExpandProperty Groups | Where-Object Name -EQ 1 | Select-Object -ExpandProperty Value
    }
}

$used = $used | Sort-Object -Unique
$used = $used | Where-Object { $_ -match "$($registry).azurecr.io/" }
$used = $used | Where-Object { $_ -notmatch '#{' } # had few broken deployments where octopus variables like '#{Octopus.Action.Package[my-app].PackageVersion}' were not substituted
$used.Count
if ($used.Count -lt 600) {
    Write-Host "Unexpected low number of images $($used.Count)" -ForegroundColor Red
    exit 1
}

# Step 2: cleanup container registry
# Restrictions:
# - keep used
# - keep latest
# - keep last 10 tags for any branch
# - keep last 10 tags for main branch
# - keep last 3 months
$repositories = az acr repository list -n $registry | ConvertFrom-Json
foreach($repository in $repositories) {
    $tags = az acr repository show-tags -n $registry --repository $repository --orderby time_desc --detail | ConvertFrom-Json

    $lastTenTagsAnyBranch = $tags | Select-Object -First 10 -ExpandProperty name -Unique
    $lastTenTagsMainBrans = $tags | Where-Object { ($_.name -match '-main$') -or ($_.name -match '-master$') } | Select-Object -First 10 -ExpandProperty name -Unique
    $tagsToKeep = $lastTenTagsAnyBranch + $lastTenTagsMainBrans

    foreach($tag in $tags) {
        if ($tag.name.Contains('latest')) {
            Write-Host "$($repository):$($tag.name) - keep latest" -ForegroundColor Cyan
            continue
        }
        if ("$($registry).azurecr.io/$($repository):$($tag.name)" -in $used) {
            Write-Host "$($repository):$($tag.name) - keep used" -ForegroundColor Cyan
            continue
        }
        if ($tag.name -in $tagsToKeep) {
            Write-Host "$($repository):$($tag.name) - keep last 10" -ForegroundColor Cyan
            continue
        }
        $age = [int]((Get-Date).Subtract($tag.createdTime).Days)
        if ($age -lt 90) {
            Write-Host "$($repository):$($tag.name) - keep young ($($age) days old)" -ForegroundColor Cyan
            continue
        }

        try {
            # az acr repository delete -n $registry --image "$($repository):$($tag.name)" --yes
            Write-Host "$($repository):$($tag.name) - deleted" -ForegroundColor Green
        } catch {
            Write-Host "$($repository):$($tag.name) - failed" -ForegroundColor Red
        }
    }
}

So nothing fancy here, techincally we may even run it as GitHub scheduled action

But the question is - how can we get list of all used images being outside of clusters?

There is an opportunity to make yet another trick

We are going to deploy small deployment, backed by powershell script (as described in fastcgi notes)

It will run under elevated service account so will be able to list all images in cluster

And for authentication purposes we will allow tokens signed by either Azure or GitHub

Which means from GitHub actions we will sign OIDC token and just call our API in all clusters to receive all images from everywhere without the need to be inside the cluster

And what is cool - no need to share/rotate/revoke any secrets - everything is "passwordless"

First of all we need custom entry point to install fastcgi and enable it

#!/bin/sh

apk add fcgiwrap spawn-fcgi curl powershell

curl -sLO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
chmod +x kubectl
mv ./kubectl /usr/bin/kubectl

/usr/bin/spawn-fcgi -s /var/run/fcgiwrap.socket -M 766 /usr/bin/fcgiwrap

We are installing wanted tools like powershell, kubectl, fastcgi itself and spawning fastcgi

The idea behind all this is that we even do not need dedicated docker image, so there is nothing to build or publish we are just building the thing by configuring existing tools in a right order

The next step is to configure nginx

server {
  listen 80;
  root   /usr/share/nginx/html;

  location / {
    index  index.ps1;
    fastcgi_pass unix:/var/run/fcgiwrap.socket;
    fastcgi_param HTTP_AUTHORIZATION $http_authorization;
    fastcgi_param SCRIPT_FILENAME /usr/share/nginx/html/index.ps1;
  }
}

Note how we are manually passing HTTP authorization header to our script, it will be available as corresponding envionrment variable, and our goal is to validate incomming request authorization before giving the response (read below)

And finaly our script boilerplate

#!/usr/bin/env pwsh

Write-Host "Content-Type: text/plain"
Write-Host ""
Write-Host "Hello, auth: $($env:HTTP_AUTHORIZATION)"

Now to deploy all this we need prepare manifests

First of all we gonna need service account that is allowed to list resources, this will be needed to retrieve images used by this resources

service-account.yml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: docker-images

cluster-role-binding.yml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: docker-images
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:
  - kind: ServiceAccount
    name: docker-images
    namespace: production

Note: do not forget to change namespace and probably the actual role, I just used prebuild one

Our deployment

deployment.yml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: docker-images
  labels:
    app: docker-images
spec:
  replicas: 1
  selector:
    matchLabels:
      app: docker-images
  template:
    metadata:
      labels:
        app: docker-images
    spec:
      serviceAccountName: docker-images
      nodeSelector:
        kubernetes.io/os: linux
      containers:
        - name: docker-images
          image: nginx:alpine
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: 10m
              memory: 64Mi
            limits:
              cpu: 500m
              memory: 1536Mi
          volumeMounts:
            - name: docker-images
              mountPath: /docker-entrypoint.d/00-entrypoint.sh
              subPath: entrypoint.sh
            - name: docker-images
              mountPath: /etc/nginx/conf.d/default.conf
              subPath: default.conf
            - name: docker-images
              mountPath: /usr/share/nginx/html/index.ps1
              subPath: index.ps1
      volumes:
        - name: docker-images
          configMap:
            name: docker-images
            defaultMode: 0777

Nothing fancy here, nginx image does automatically run all scripts under /docker-entrypoint.d so we are passing our entrypoint and two other files are self explanatory

service.yml

For demo purposes creating LoadBalancer service, but in real life you would want ClusterIP and run it behind ingress

apiVersion: v1
kind: Service
metadata:
  name: docker-images
spec:
  type: LoadBalancer
  selector:
    app: docker-images
  ports:
    - port: 80

And finaly

kustomization.yml

---
resources:
  - service-account.yml
  - cluster-role-binding.yml
  - deployment.yml
  - service.yml
configMapGenerator:
  - name: docker-images
    files:
    - entrypoint.sh
    - default.conf
    - index.ps1
generatorOptions:
  disableNameSuffixHash: true
  labels:
    app: docker-images
commonLabels:
  app: docker-images
commonAnnotations:
  owner: [email protected]
  repository: https://github.com/mac2000/notes

After applying all this, we should be able to curl service and it should respond us with hello message

kubectl apply -k .
kubectl get svc docker-images
curl -H "Authorization: Bearer ACME" http://20.103.153.129/whatever

and you should see

Hello, auth: Bearer ACME

Notes:

  • if it is not working double check paths and file names everywhere, in my case I forgot to rename index.sh to index.ps1 here and there
  • if it still does not work, try running it locally like so docker run -it --rm -p 8080:80 -v "$PWD/entrypoint.sh:/docker-entrypoint.d/00-entrypoint.sh" -v "$PWD/default.conf:/etc/nginx/conf.d/default.conf" -v "$PWD/index.ps1:/usr/share/nginx/html/index.ps1" nginx:alpine

Now, when boilerplate is ready the only thing left for us is to modify our index.ps1 so it will:

  • verify incomming authorization header
  • if everything is ok respond with used images

Second part is already done in script at the beginning so lets start from it aka:

#!/usr/bin/env pwsh

if (-not $env:HTTP_AUTHORIZATION) {
  Write-Host "Status: 401"
  Write-Host "Content-Type: text/plain"
  Write-Host ""
  Write-Host "Unauthorized"
  return
}
# TODO: verify $env:HTTP_AUTHORIZATION

Write-Host "Content-Type: application/json"
Write-Host ""
$used = @()
foreach($tag in kubectl get all -A -o json | ConvertFrom-Json | Select-Object -ExpandProperty items) {
  $used += $tag | ConvertTo-Json -Depth 100 | Select-String -Pattern '"image": "(.+)"' | Select-Object -ExpandProperty Matches | Select-Object -ExpandProperty Groups | Where-Object Name -EQ 1 | Select-Object -ExpandProperty Value
}
$used = $used | Sort-Object -Unique
$used | ConvertTo-Json | Out-Host

And here is when everything was screwed up

Powershell not only slow but eats memory just to give you an idea even 1Gb of RAM is not enough which is way to big for such small script

My idea was to later introduce following snippet of JWT verification

$token = $env:HTTP_AUTHORIZATION

# extract header as json object
$header = $token.Split('.')[0]
$header = $header.Replace('-', '+').Replace('_', '/')
$header = $header.PadRight($header.Length + (3 - (($header.Length + 3) % 4)), '=')
$header = [Convert]::FromBase64String($header)
$header = [System.Text.Encoding]::UTF8.GetString($header)
$header = ConvertFrom-Json($header)
# $header # {"typ":"JWT","alg":"RS256","x5t":"eBZ_cn3sXYAd0ch4THBKHIgOwOE","kid":"78167F727DEC5D801DD1C8784C704A1C880EC0E1"}

# extract payload as json object
$payload = $token.Split('.')[1]
$payload = $payload.Replace('-', '+').Replace('_', '/')
$payload = $payload.PadRight($payload.Length + (3 - (($payload.Length + 3) % 4)), '=')
$payload = [Convert]::FromBase64String($payload)
$payload = [System.Text.Encoding]::UTF8.GetString($payload)
$payload = ConvertFrom-Json($payload)
# $payload # {"iss":"https://token.actions.githubusercontent.com","sub":"repo:mac2000/token:ref:refs/heads/main"...

# signature bytes
$signature = $token.Split('.')[2]
$signature = $signature.Replace('-', '+').Replace('_', '/')
$signature = $signature.PadRight($signature.Length + (3 - (($signature.Length + 3) % 4)), '=')
$signature = [Convert]::FromBase64String($signature)
# $signature # array of bytes

# retrieve key info used to sign the token
$issuer = $payload.iss # https://sts.windows.net/695e64b5-2d13-4ea8-bb11-a6fda2d60c41/
$jwks = Invoke-RestMethod "$($issuer.TrimEnd('/'))/.well-known/openid-configuration" | Select-Object -ExpandProperty jwks_uri # https://login.windows.net/common/discovery/keys
$keys = Invoke-RestMethod $jwks | Select-Object -ExpandProperty keys
$key = $keys | Where-Object kid -EQ $header.kid # {"kty":"RSA","use":"sig","kid":"-KI3Q9nNR7bRofxmeZoXqbHZGew","n":"base64url(modulus)","e":"base64url(expnent)"...


$rsa = New-Object System.Security.Cryptography.RSACryptoServiceProvider
$modulus = $key.n
$modulus = $modulus.Replace('-', '+').Replace('_', '/')
$modulus = $modulus.PadRight($modulus.Length + (3 - (($modulus.Length + 3) % 4)), '=')
$modulus = [Convert]::FromBase64String($modulus)
$exponent = $key.e
$exponent = $exponent.Replace('-', '+').Replace('_', '/')
$exponent = $exponent.PadRight($exponent.Length + (3 - (($exponent.Length + 3) % 4)), '=')
$exponent = [Convert]::FromBase64String($exponent)
$p = New-Object System.Security.Cryptography.RSAParameters
$p.Modulus = $modulus
$p.Exponent = $exponent
$rsa.ImportParameters($p)

# verify signature - we are comparing "header.payload" with signature from incomming token if they equal - token is valid
$valid = $rsa.VerifyData([System.Text.Encoding]::UTF8.GetBytes(($token.Split('.')[0] + '.' + $token.Split('.')[1])), $signature, [System.Security.Cryptography.HashAlgorithmName]::SHA256, [System.Security.Cryptography.RSASignaturePadding]::Pkcs1)

if ($valid) {
    Write-Output "JWT token signature is valid"
} else {
    Write-Output "JWT token signature is not valid"
}

Which can be used to verify incomming JWT tokens

The beauty of approach is that we have full control and it does not matter if token comes from Azure, Google or in our case GitHub

Also we can describe as many rules as we want

As about GitHub it will be something like

- uses: actions/github-script@v6
  with:
    script: |
      core.getIDToken('docker-images')
        .then(token => fetch('https://docker-images.mac-blog.org.ua', {headers: {Authorization: `Bearer ${token}`}}))
        .then(res => res.json())
        .then(images => console.log(images))

So we can talk to our cluster without sharing secrets and retrieve images

But in current implementation resource usage is not acceptable

Alternative approach may be to run kubectl proxy, something like this:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: mactemp
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: mactemp
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:
  - kind: ServiceAccount
    name: mactemp
    namespace: production
---
apiVersion: v1
kind: Pod
metadata:
  name: mactemp
  labels:
    app: mactemp
spec:
  serviceAccountName: mactemp
  nodeSelector:
    kubernetes.io/os: linux
  containers:
    - name: mactemp
      image: bitnami/kubectl
      command: ["kubectl", "proxy", "--address=0.0.0.0", "--port=8001", "--reject-methods='POST,PUT,PATCH,DELETE'"]
      ports:
        - containerPort: 8001
      resources:
        limits:
          cpu: 500m
          memory: 128Mi
  volumes:
    - name: mactemp
      configMap:
        name: mactemp
---
apiVersion: v1
kind: Service
metadata:
  name: mactemp
spec:
  type: LoadBalancer
  selector:
    app: mactemp
  ports:
    - port: 80
      targetPort: 8001

And it indeed works but it seems to be much more than needed, even so we are giving readonly kind of access it is much more than wanted images

Another alternative will be to try make all this with help of good old bash script, the hardest part will be token validation and indeed image retireval may be done as easy as

index.sh

#!/usr/bin/env bash
echo "Content-Type: text/plain"
echo ""
kubectl get all -A -o json | jq -r '.. | .image? // empty' | sort | uniq

but verification of JWT token is a nightmare

One another approach will be to use oauth2-proxy, the downside here is that it supports only one provider which means it is a one way ticket e.g. it will work only with GitHub tokens and there will be no options to call it with Azure tokens

So here is deployment with oauth proxy

apiVersion: apps/v1
kind: Deployment
metadata:
  name: docker-images
  labels:
    app: docker-images
spec:
  replicas: 1
  selector:
    matchLabels:
      app: docker-images
  template:
    metadata:
      labels:
        app: docker-images
    spec:
      serviceAccountName: docker-images
      nodeSelector:
        kubernetes.io/os: linux
      containers:
        - name: docker-images
          image: nginx:alpine
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: 10m
              memory: 64Mi
            limits:
              cpu: 500m
              memory: 1024Mi
          volumeMounts:
            - name: docker-images
              mountPath: /docker-entrypoint.d/00-entrypoint.sh
              subPath: entrypoint.sh
            - name: docker-images
              mountPath: /etc/nginx/conf.d/default.conf
              subPath: default.conf
            - name: docker-images
              mountPath: /usr/share/nginx/html/index.sh
              subPath: index.sh
        - name: auth
          image: bitnami/oauth2-proxy
          imagePullPolicy: IfNotPresent
          args:
            # GITHUB
            # ------
            # # restrictions (repository_owner=rabotaua, repository=rabotaua/mactemp, ref=refs/heads/main, sha=616c..., repository_visibility=private, actor=mac2000, workflow=main)
            # - "--oidc-groups-claim=repository_owner"
            # - "--allowed-group=rabotaua"
            # # allowed audience
            # - "--client-id=acr"
            # # allowed issuer
            # - "--oidc-issuer-url=https://token.actions.githubusercontent.com"
            # # act as jwt verifier
            # - "--skip-jwt-bearer-tokens=true"
            # # respond on success
            # - "--upstream=http://localhost"
            # # default is 127.0.0.1:4180
            # - "--http-address=0.0.0.0:4180"
            # # rest is required but values does not matter
            # - "--standard-logging=true"
            # - "--auth-logging=true"
            # - "--request-logging=true"
            # - "--provider=oidc"
            # - "--client-secret=whatever"
            # - "--email-domain=*"
            # - "--cookie-secret=BFNF3nGJmIzVxojA8g68kbZwowEVQr9wKICF-LsTJTs="
            # - "--upstream-timeout=60s"
            # - "--force-json-errors=true"
            # AZURE
            # -----
            # # restrictions (repository_owner=rabotaua, repository=rabotaua/mactemp, ref=refs/heads/main, sha=616c..., repository_visibility=private, actor=mac2000, workflow=main)
            - "--oidc-groups-claim=groups"
            - "--allowed-group=82281f89-d39e-4203-b9b4-a388a1361ac7"
            # allowed audience
            - "--client-id=https://management.core.windows.net/"
            # allowed issuer
            - "--oidc-issuer-url=https://sts.windows.net/695e64b5-2d13-4ea8-bb11-a6fda2d60c41/"
            # act as jwt verifier
            - "--skip-jwt-bearer-tokens=true"
            # respond on success
            - "--upstream=http://127.0.0.1:80"
            # default is 127.0.0.1:4180
            - "--http-address=0.0.0.0:4180"
            # rest is required but values does not matter
            - "--standard-logging=true"
            - "--auth-logging=true"
            - "--request-logging=true"
            - "--provider=oidc"
            - "--client-secret=whatever"
            - "--email-domain=*"
            - "--cookie-secret=BFNF3nGJmIzVxojA8g68kbZwowEVQr9wKICF-LsTJTs="
            - "--upstream-timeout=60s"
            - "--force-json-errors=true"
          ports:
            - containerPort: 4180
          resources:
            requests:
              cpu: 10m
              memory: 64Mi
            limits:
              cpu: 500m
              memory: 128Mi
      volumes:
        - name: docker-images
          configMap:
            name: docker-images
            defaultMode: 0777

Note: for test purposes I added Azure instead of GitHub so I can test it like so

curl -H "Authorization: Bearer $(az account get-access-token --query=accessToken -o tsv)" http://51.105.108.100:4180/

Also few small adjustments are required like allowed groups or repositories and it is important to increase default timeout which is not enough

But still, even with plain bash script in Prometheus I see spikes to up to 500mb of RAM usage which drives me crazy, arghhh....

But technically speaking the goal achieved, from now we can deploy this and create github action that will traverse all clusters with its token, receive used images and then delete unwanted images from container registry

Wondering, how it may look like in Go, and if it is possible to run go server without building image