Google Anthos with Terraform and Kubestack

For a project, I'm currently evaluating Google Anthos. Since the client is a multi-national company, the idea is to build a multi-region and multi-cloud Kubernetes platform. But the sensible kind, no need to bring the pitchforks. So different applications each in one region and cloud. Not the mythical one application in multiple regions and clouds where complexity and necessity are often entirely disproportionate to each other. But that's not really the point here.

Google sales is pushing Anthos hard. And the client's team is open to the argument that an opinionated stack might help them move faster compared with having to first evaluate various alternatives for each part of the stack and then building the know-how to productize this custom stack. It's a fair argument to make.

Long story short, we're now evaluating Anthos with GKE and EKS clusters connected to it, because some application teams are drawn to AWS and some are drawn to Google Cloud with their respective workloads. Individual reasons for this are pretty diverse, ranging from data stickiness like terabytes of data already in S3 to quota/capacity limits of specific types of GPUs or preferring certain managed services from one provider over the other cloud provider's alternative.

I tend to agree with this kind of multi-cloud strategy making a lot of sense. Yes, individual apps may still end up locked-in to one vendor. But at least it's not all eggs in one basket which has real benefits both on blast radius and pricing negotiations, if you're big enough.

I've been working on this evaluation for a couple of days now and thought I'd share my experience because I couldn't find a lot of hands-on reports about Anthos with Terraform. Most content seemed primarily hypothetical and it seemed like most writers hadn't actually gotten their hands properly dirty before writing about it. I already washed mine to calm down and make sure this doesn't end up in some obnoxious rant.

The first thing, that really surprised me about Anthos though, is that Anthos does not provision the Kubernetes clusters for you. I totally expected it would. Instead, you have to provision the clusters and then connect them to the Anthos hub. Which basically requires some IAM setup, an Anthos hub membership and running an agent inside each cluster.

Anthos leaves it up to you to provision clusters any way you like. But since I'm part of the project, it may not come as a surprise that in our case, infrastructure as code and Terraform are the attack plan.

Now, Google does even provide its own Anthos Terraform modules. But these are only for GKE, meaning for EKS we'd need to use modules from another source. Leaving us to deal with different module usage and update schedules.

But more importantly, Google's Terraform modules constantly shell out to kubectl or gcloud CLIs. Which I consider a last resort that should be avoided at any cost for Terraform modules, because long term maintainability. Frankly, calling CLI commands like this has no place in declarative infrastructure as far as I'm concerned.

Unsurprisingly, my biased proposal is to use Kubestack to provision the GKE and EKS clusters leveraging the Kubestack framework's unified GKE and EKS modules, and to write a custom module to connect the resulting clusters to Anthos. The bespoke module would integrate the IAM, Anthos and Kubernetes resources required fully into the Terraform state and lifecycle instead of calling kubectl and gcloud like the official Google modules do.

Below is the current work in progress state of the experimental module and some of the challenges I hit so far.

The first requirement is an IAM identity and role for the agent inside the cluster. For GKE clusters, workload identities can be used but for non GKE clusters, EKS in our case, it seems shared credentials in the form of a service account key are the only option. Creating google_service_account, google_project_iam_member and google_service_account_key resources is easy enough. I'm sure this is overly simplified and I may have to add more roles as my evaluation continues.

resource "google_service_account" "current" {
  project = local.project_id
  account_id   = local.cluster_name
  display_name = "${local.cluster_name} gke-connect agent"
}

resource "google_project_iam_member" "current" {
  project = local.project_id
  role    = "roles/gkehub.connect"
  member  = "serviceAccount:${google_service_account.current.email}"
}

resource "google_service_account_key" "current" {
  service_account_id = google_service_account.current.name
}

The next step is to register the cluster as a member of the Anthos hub. Which means adding a google_gke_hub_membership resource.

resource "google_gke_hub_membership" "current" {
  provider = google-beta

  project = local.project_id
  membership_id = local.cluster_name
  description = "${local.cluster_name} hub membership"
}

Finally, the agent needs to be provisioned inside the cluster and set up to use the service account as its identity.

By default, joining the cluster to the hub and provisioning the Kubernetes resources of the agent on the cluster is done via the gcloud beta container hub memberships register CLI command. But the command has a --manifest-output-file parameter, that allows writing the Kubernetes resources to a file instead of applying it to the cluster directly.

To not also have to fall back to calling the register gcloud command from Terraform, I opted to write the manifests to a YAML file and use them as the base that I patch in a kustomization_overlay using my Kustomization provider.

This way, I will have each individual Kubernetes resource of the Anthos agent to be provisioned and tracked using Terraform. While at the same time being able to use the attributes from my service account and service account key resources to configure the agent.

data "kustomization_overlay" "current" {
  namespace = "gke-connect"

  resources = [
      "${path.module}/upstream_manifest/anthos.yaml"
  ]

  secret_generator {
    name = "creds-gcp"
    type = "replace"
    literals = [
      "creds-gcp.json=${base64decode(google_service_account_key.current.private_key)}"
    ]
  }

  patches {
    # this breaks if the order of env vars in the upstream YAML changes
    patch = <<-EOF
      - op: replace
        path: /spec/template/spec/containers/0/env/6/value
        value: "//gkehub.googleapis.com/projects/xxxxxxxxxxxx/locations/global/memberships/${local.cluster_name}"
    EOF

    target = {
      group = "apps"
      version = "v1"
      kind = "Deployment"
      name = "gke-connect-agent-20210514-00-00"
      namespace = "gke-connect"
    }
  }
}

The manifests gcloud writes to disk can't be committed to version control because they include a Kubernetes secret with the plaintext service account key embedded. The key file is unfortunately a required parameter of the hub memberships register command. So I had to delete this secret from the YAML file. And have to remember doing this whenever I rerun the command to update my base with the latest upstream manifests.

In the kustomization_overlay data source, I then use a secret_generator to create a Kubernetes secret using the private key from the google_service_account_key resource.

Additionally, the agent has a number of environment variables set. The URL to the hub memberships resource is one of them and needs to be patched with the respective cluster name. Unfortunately, the environment variables are set directly in the pod template. So the patch will break if the number or order of environment variables changes. It would be better to change this to envFrom and set the environment variables dynamically in the overlay using a config_map_generator. But the downside of this is, again, that there's one more modification to the upstream YAML which has to be repeated every time it is updated.

While we're on the topic of updates. One thing that makes me suspicious is that the generated YAML has a date as part of its resource names. E.g. gke-connect-agent-20210514-00-00. Call me a pessimist, but I totally expect this to become a problem with updates in the future.

Ignoring that for now. Next on my evaluation was to apply my Terraform configuration and hopefully have my clusters connected to Anthos.

Unfortunately, on the first try, that wasn't quite the case. The clusters did show up in the Anthos UI. But had a big red unreachable warning. As it turned out, this was due to the agent pod crash looping with a permission denied error.

kubectl -n gke-connect logs gke-connect-agent-20210514-00-00-66d94cff9d-tzw5t
2021/07/02 11:40:33.277997 connect_agent.go:17: GKE Connect Agent. Log timestamps in UTC.
2021/07/02 11:40:33.298969 connect_agent.go:21: error creating tunnel: unable to retrieve namespace "kube-system" to be used as externalID: namespaces "kube-system" is forbidden: User "system:serviceaccount:gke-connect:connect-agent-sa" cannot get resource "namespaces" in API group "" in the namespace "kube-system"

Which is weird, because from reading the gcloud generated YAML I remember there were plenty of RBAC related resources included. Digging in, it turned out the generated YAML has a Role and RoleBinding. And if you followed carefully, you probably guessed the issue already. Here's the respective part of the generated resources:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    hub.gke.io/project: "ie-gcp-poc"
    version: 20210514-00-00
  name: gke-connect-namespace-getter
  namespace: kube-system
rules:
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    hub.gke.io/project: "ie-gcp-poc"
    version: 20210514-00-00
  name: gke-connect-namespace-getter
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: gke-connect-namespace-getter
subjects:
- kind: ServiceAccount
  name: connect-agent-sa
  namespace: gke-connect

Unless I'm terribly wrong here, this obviously can't work. Creating a namespaced role and role binding inside the kube-system namespace can not grant permissions to get the kube-system namespace, because namespaces are not namespaced resources.

So I changed the Role to a ClusterRole and the RoleBinding to a ClusterRoleBinding and reapplied my Terraform configuration. And I now have a running agent, that established a tunnel to the Anthos control plane and prints lots of log messages. I have yet to dig into what it actually does there.

With the RBAC fix the generated YAML now already requires three changes to maintain over time. I can't say I'm particularly excited about that. I also wonder if the generated YAML is only broken if the --manifest-output-file parameter is used or if the RBAC configuration is also broken when directly applying the Kubernetes resources to the cluster using the gcloud CLI.

That's it for my evaluation with Google Anthos, Terraform and Kubestack so far. Maybe by sharing my findings, I may safe somebody out there a bit of time in their own evaluation when they hit the same issues.

Next step for me is to look into provisioning the Anthos Service Mesh. It's not quite clear to me yet, if that should be done via Terraform, the fact that Google has this kubectl based Terraform module for that may suggest so. But why wouldn't I do everything after the cluster is connected to Anthos using Anthos Config Management?

24