review-deploy["review-deploy<br><br>Helm deploys the Review App using the Cloud<br/>Native images built by the CNG-mirror pipeline.<br><br>Cloud Native images are deployed to the `review-apps-ce` or `review-apps-ee`<br>Kubernetes (GKE) cluster, in the GCP `gitlab-review-apps` project."]
If [Review App Stability](https://gitlab.com/gitlab-org/quality/team-tasks/issues/93) dips this may be a signal
that the `review-apps-ce/ee` cluster is unhealthy. Leading indicators may be healthcheck failures leading to restarts or majority failure for Review App deployments.
The following items may help diagnose this:
- [Instance group CPU Utilization in GCP](https://console.cloud.google.com/compute/instanceGroups/details/us-central1-b/gke-review-apps-ee-preemp-n1-standard-8affc0f5-grp?project=gitlab-review-apps&tab=monitoring&graph=GCE_CPU&duration=P30D) - helpful to identify if nodes are problematic or the entire cluster is trending towards unhealthy
- [Instance Group size in GCP](https://console.cloud.google.com/compute/instanceGroups/details/us-central1-b/gke-review-apps-ee-preemp-n1-standard-8affc0f5-grp?project=gitlab-review-apps&tab=monitoring&graph=GCE_SIZE&duration=P30D) - aids in identifying load spikes on the cluster. Kubernetes will add nodes up to 220 based on total resource requests.
-`kubectl top nodes --sort-by=cpu` - can identify if node spikes are common or load on specific nodes which may get rebalanced by the Kubernetes scheduler.
-`kubectl top pods --sort-by=cpu` -
- [K9s] - K9s is a powerful command line dashboard which allows you to filter by labels. This can help identify trends with apps exceeding the [review-app resource requests](https://gitlab.com/gitlab-org/gitlab-ce/blob/master/scripts/review_apps/base-config.yaml). Kubernetes will schedule pods to nodes based on resource requests and allow for CPU usage up to the limits.
- In K9s you can sort or add filters by typing the `/` character
-`-lrelease=<review-app-slug>` - filters down to all pods for a release. This aids in determining what is having issues in a single deployment
-`-lapp=<app>` - filters down to all pods for a specific app. This aids in determining resource usage by app.
- You can scroll to a Kubernetes resource and hit `d`(describe), `s`(shell), `l`(logs) for a deeper inspection
that the `dns-gitlab-review-app-external-dns` Deployment was in a pending state,
effectively preventing all the Review Apps from getting a DNS record assigned,
making them unreachable via domain name.
This in turn prevented other components of the Review App to properly start
(e.g. `gitlab-runner`).
After some digging, we found that new mounts were failing, when being performed
with transient scopes (e.g. pods) of `systemd-mount`:
```
MountVolume.SetUp failed for volume "dns-gitlab-review-app-external-dns-token-sj5jm" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/06add1c3-87b4-11e9-80a9-42010a800107/volumes/kubernetes.io~secret/dns-gitlab-review-app-external-dns-token-sj5jm --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/06add1c3-87b4-11e9-80a9-42010a800107/volumes/kubernetes.io~secret/dns-gitlab-review-app-external-dns-token-sj5jm
Output: Failed to start transient scope unit: Connection timed out
```
This probably happened because the GitLab chart creates 67 resources, leading to
a lot of mount points being created on the underlying GCP node.
The [underlying issue seems to be a `systemd` bug](https://github.com/kubernetes/kubernetes/issues/57345#issuecomment-359068048)
that was fixed in `systemd``v237`. Unfortunately, our GCP nodes are currently
using `v232`.
For the record, the debugging steps to find out this issue were:
1. Switch kubectl context to review-apps-ce (we recommend using [kubectx](https://kubectx.dev/))
1.`kubectl get pods | grep dns`
1.`kubectl describe pod <pod name>`& confirm exact error message
1. Web search for exact error message, following rabbit hole to [a relevant kubernetes bug report](https://github.com/kubernetes/kubernetes/issues/57345)
1. Access the node over SSH via the GCP console (**Computer Engine > VM
- Get all pods running a given node: `kubectl get pods --field-selector=spec.nodeName=NODE_NAME`
- Get all the `Running` pods on a given node: `kubectl get pods --field-selector=spec.nodeName=NODE_NAME | grep Running`
- Get all the pods in a bad state on a given node: `kubectl get pods --field-selector=spec.nodeName=NODE_NAME | grep -v 'Running' | grep -v 'Completed'`
#### Solving the problem
To resolve the problem, we needed to (forcibly) drain some nodes:
1. Try a normal drain on the node where the `dns-gitlab-review-app-external-dns`
1. If that doesn't work, you can also perform a forcible "drain" the node by removing all pods: `kubectl delete pods --field-selector=spec.nodeName=NODE_NAME`
1. In the node:
- Perform `systemctl daemon-reload` to remove the dead/inactive units
- If that doesn't solve the problem, perform a hard reboot: `sudo systemctl reboot`
1. Uncordon any cordoned nodes: `kubectl uncordon NODE_NAME`
In parallel, since most Review Apps were in a broken state, we deleted them to
clean up the list of non-`Running` pods.
Following is a command to delete Review Apps based on their last deployment date