debian-mirror-gitlab/doc/development/testing_guide/review_apps.md

428 lines
21 KiB
Markdown
Raw Normal View History

2021-01-29 00:20:46 +05:30
---
stage: none
group: unassigned
2021-02-22 17:27:13 +05:30
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
2021-01-29 00:20:46 +05:30
---
2019-02-15 15:39:39 +05:30
# Review Apps
2018-12-05 23:21:45 +05:30
2019-12-04 20:38:33 +05:30
Review Apps are automatically deployed by [the
2020-03-13 15:44:24 +05:30
pipeline](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/6665).
2018-12-05 23:21:45 +05:30
## How does it work?
2019-07-31 22:56:46 +05:30
### CI/CD architecture diagram
2019-03-02 22:35:43 +05:30
2019-09-30 21:07:59 +05:30
```mermaid
2019-03-02 22:35:43 +05:30
graph TD
2020-07-28 23:09:34 +05:30
A["build-qa-image, compile-production-assets<br/>(canonical default refs only)"];
2020-04-08 14:13:33 +05:30
B[review-build-cng];
C[review-deploy];
D[CNG-mirror];
E[review-qa-smoke];
A -->|once the `prepare` stage is done| B
B -.->|triggers a CNG-mirror pipeline and wait for it to be done| D
D -.->|polls until completed| B
B -->|once the `review-build-cng` job is done| C
C -->|once the `review-deploy` job is done| E
subgraph "1. gitlab `prepare` stage"
A
end
subgraph "2. gitlab `review-prepare` stage"
B
end
subgraph "3. gitlab `review` stage"
2020-07-28 23:09:34 +05:30
C["review-deploy<br><br>Helm deploys the Review App using the Cloud<br/>Native images built by the CNG-mirror pipeline.<br><br>Cloud Native images are deployed to the `review-apps`<br>Kubernetes (GKE) cluster, in the GCP `gitlab-review-apps` project."]
2020-04-08 14:13:33 +05:30
end
subgraph "4. gitlab `qa` stage"
E[review-qa-smoke<br><br>gitlab-qa runs the smoke suite against the Review App.]
end
2019-07-31 22:56:46 +05:30
2019-09-30 21:07:59 +05:30
subgraph "CNG-mirror pipeline"
2020-04-08 14:13:33 +05:30
D>Cloud Native images are built];
end
2019-09-30 21:07:59 +05:30
```
2019-03-02 22:35:43 +05:30
### Detailed explanation
2020-07-28 23:09:34 +05:30
1. On every [pipeline](https://gitlab.com/gitlab-org/gitlab/pipelines/125315730) during the `prepare` stage, the
[`compile-production-assets`](https://gitlab.com/gitlab-org/gitlab/-/jobs/641770154) job is automatically started.
- Once it's done, the [`review-build-cng`](https://gitlab.com/gitlab-org/gitlab/-/jobs/467724808)
job starts since the [`CNG-mirror`](https://gitlab.com/gitlab-org/build/CNG-mirror) pipeline triggered in the
2019-07-31 22:56:46 +05:30
following step depends on it.
2020-07-28 23:09:34 +05:30
1. Once `compile-production-assets` is done, the [`review-build-cng`](https://gitlab.com/gitlab-org/gitlab/-/jobs/467724808)
job [triggers a pipeline](https://gitlab.com/gitlab-org/build/CNG-mirror/pipelines/44364657)
2020-04-22 19:07:51 +05:30
in the [`CNG-mirror`](https://gitlab.com/gitlab-org/build/CNG-mirror) project.
2020-07-28 23:09:34 +05:30
- The `review-build-cng` job automatically starts only if your MR includes
[CI or frontend changes](../pipelines.md#changes-patterns). In other cases, the job is manual.
2020-04-22 19:07:51 +05:30
- The [`CNG-mirror`](https://gitlab.com/gitlab-org/build/CNG-mirror/pipelines/44364657) pipeline creates the Docker images of
2019-07-31 22:56:46 +05:30
each component (e.g. `gitlab-rails-ee`, `gitlab-shell`, `gitaly` etc.)
2020-04-22 19:07:51 +05:30
based on the commit from the [GitLab pipeline](https://gitlab.com/gitlab-org/gitlab/pipelines/125315730) and stores
them in its [registry](https://gitlab.com/gitlab-org/build/CNG-mirror/container_registry).
2020-06-23 00:09:42 +05:30
- We use the [`CNG-mirror`](https://gitlab.com/gitlab-org/build/CNG-mirror) project so that the `CNG`, (Cloud
2020-07-28 23:09:34 +05:30
Native GitLab), project's registry is not overloaded with a lot of transient Docker images.
2019-07-31 22:56:46 +05:30
- Note that the official CNG images are built by the `cloud-native-image`
2020-04-22 19:07:51 +05:30
job, which runs only for tags, and triggers itself a [`CNG`](https://gitlab.com/gitlab-org/build/CNG) pipeline.
2020-07-28 23:09:34 +05:30
1. Once `review-build-cng` is done, the [`review-deploy`](https://gitlab.com/gitlab-org/gitlab/-/jobs/467724810) job
2020-04-22 19:07:51 +05:30
deploys the Review App using [the official GitLab Helm chart](https://gitlab.com/gitlab-org/charts/gitlab/) to
2020-07-28 23:09:34 +05:30
the [`review-apps`](https://console.cloud.google.com/kubernetes/clusters/details/us-central1-b/review-apps?project=gitlab-review-apps)
2019-07-31 22:56:46 +05:30
Kubernetes cluster on GCP.
- The actual scripts used to deploy the Review App can be found at
2020-04-22 19:07:51 +05:30
[`scripts/review_apps/review-apps.sh`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/scripts/review_apps/review-apps.sh).
2019-07-31 22:56:46 +05:30
- These scripts are basically
2020-04-22 19:07:51 +05:30
[our official Auto DevOps scripts](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/ci/templates/Auto-DevOps.gitlab-ci.yml) where the
2019-07-31 22:56:46 +05:30
default CNG images are overridden with the images built and stored in the
2020-04-22 19:07:51 +05:30
[`CNG-mirror` project's registry](https://gitlab.com/gitlab-org/build/CNG-mirror/container_registry).
- Since we're using [the official GitLab Helm chart](https://gitlab.com/gitlab-org/charts/gitlab/), this means
2019-07-31 22:56:46 +05:30
you get a dedicated environment for your branch that's very close to what
it would look in production.
2020-04-22 19:07:51 +05:30
1. Once the [`review-deploy`](https://gitlab.com/gitlab-org/gitlab/-/jobs/467724810) job succeeds, you should be able to
2019-07-31 22:56:46 +05:30
use your Review App thanks to the direct link to it from the MR widget. To log
into the Review App, see "Log into my Review App?" below.
2018-12-05 23:21:45 +05:30
**Additional notes:**
2019-09-04 21:01:54 +05:30
- If the `review-deploy` job keep failing (note that we already retry it twice),
2020-04-08 14:13:33 +05:30
please post a message in the `#g_qe_engineering_productivity` channel and/or create a `~"Engineering Productivity"` `~"ep::review apps"` `~bug`
2019-09-04 21:01:54 +05:30
issue with a link to your merge request. Note that the deployment failure can
reveal an actual problem introduced in your merge request (i.e. this isn't
necessarily a transient failure)!
2020-04-08 14:13:33 +05:30
- If the `review-qa-smoke` job keeps failing (note that we already retry it twice),
2019-09-04 21:01:54 +05:30
please check the job's logs: you could discover an actual problem introduced in
your merge request. You can also download the artifacts to see screenshots of
the page at the time the failures occurred. If you don't find the cause of the
failure or if it seems unrelated to your change, please post a message in the
`#quality` channel and/or create a ~Quality ~bug issue with a link to your
merge request.
2020-04-08 14:13:33 +05:30
- The manual `review-stop` can be used to
2019-07-31 22:56:46 +05:30
stop a Review App manually, and is also started by GitLab once a merge
request's branch is deleted after being merged.
2021-02-22 17:27:13 +05:30
- The Kubernetes cluster is connected to the `gitlab` projects using the
[GitLab Kubernetes integration](../../user/project/clusters/index.md). This basically
2020-07-28 23:09:34 +05:30
allows to have a link to the Review App directly from the merge request widget.
2019-02-15 15:39:39 +05:30
2020-04-08 14:13:33 +05:30
### Auto-stopping of Review Apps
Review Apps are automatically stopped 2 days after the last deployment thanks to
2020-05-24 23:13:21 +05:30
the [Environment auto-stop](../../ci/environments/index.md#environments-auto-stop) feature.
2020-04-08 14:13:33 +05:30
If you need your Review App to stay up for a longer time, you can
2020-05-24 23:13:21 +05:30
[pin its environment](../../ci/environments/index.md#auto-stop-example) or retry the
2020-04-08 14:13:33 +05:30
`review-deploy` job to update the "latest deployed at" time.
The `review-cleanup` job that automatically runs in scheduled
pipelines (and is manual in merge request) stops stale Review Apps after 5 days,
deletes their environment after 6 days, and cleans up any dangling Helm releases
and Kubernetes resources after 7 days.
2020-04-22 19:07:51 +05:30
The `review-gcp-cleanup` job that automatically runs in scheduled pipelines
(and is manual in merge request) removes any dangling GCP network resources
that were not removed along with the Kubernetes resources.
2019-02-15 15:39:39 +05:30
## QA runs
2020-04-22 19:07:51 +05:30
On every [pipeline](https://gitlab.com/gitlab-org/gitlab/pipelines/125315730) in the `qa` stage (which comes after the
2019-07-31 22:56:46 +05:30
`review` stage), the `review-qa-smoke` job is automatically started and it runs
the QA smoke suite.
2019-02-15 15:39:39 +05:30
2019-07-31 22:56:46 +05:30
You can also manually start the `review-qa-all`: it runs the full QA suite.
2019-02-15 15:39:39 +05:30
2019-07-07 11:18:12 +05:30
## Performance Metrics
2020-04-22 19:07:51 +05:30
On every [pipeline](https://gitlab.com/gitlab-org/gitlab/pipelines/125315730) in the `qa` stage, the
2019-07-07 11:18:12 +05:30
`review-performance` job is automatically started: this job does basic
2019-07-31 22:56:46 +05:30
browser performance testing using a
2019-09-04 21:01:54 +05:30
[Sitespeed.io Container](../../user/project/merge_requests/browser_performance_testing.md).
2019-07-07 11:18:12 +05:30
2019-10-12 21:52:04 +05:30
## Cluster configuration
### Node pools
2020-07-28 23:09:34 +05:30
The `review-apps` cluster is currently set up with
2020-04-22 19:07:51 +05:30
the following node pools:
2019-10-12 21:52:04 +05:30
2020-07-28 23:09:34 +05:30
- `e2-highcpu-16` (16 vCPU, 16 GB memory) pre-emptible nodes with autoscaling
2019-10-12 21:52:04 +05:30
2020-10-24 23:57:45 +05:30
Node pool image type must be `Container-Optimized OS (cos)`, not `Container-Optimized OS with Containerd (cos_containerd)`,
due to this [known issue on GitLab Runner Kubernetes executor](https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4755)
2020-04-22 19:07:51 +05:30
### Helm
2019-10-12 21:52:04 +05:30
2020-04-22 19:07:51 +05:30
The Helm version used is defined in the
[`registry.gitlab.com/gitlab-org/gitlab-build-images:gitlab-helm3-kubectl1.14` image](https://gitlab.com/gitlab-org/gitlab-build-images/-/blob/master/Dockerfile.gitlab-helm3-kubectl1.14#L7)
2020-01-01 13:55:28 +05:30
used by the `review-deploy` and `review-stop` jobs.
2019-12-04 20:38:33 +05:30
## How to
2019-07-07 11:18:12 +05:30
2020-03-13 15:44:24 +05:30
### Get access to the GCP Review Apps cluster
2020-06-23 00:09:42 +05:30
You need to [open an access request (internal link)](https://gitlab.com/gitlab-com/access-requests/-/issues/new)
2020-10-24 23:57:45 +05:30
for the `gcp-review-apps-dev` GCP group and role.
2020-05-24 23:13:21 +05:30
2021-02-22 17:27:13 +05:30
This grants you the following permissions for:
2020-05-24 23:13:21 +05:30
2020-10-24 23:57:45 +05:30
- [Retrieving pod logs](#dig-into-a-pods-logs). Granted by [Viewer (`roles/viewer`)](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles).
- [Running a Rails console](#run-a-rails-console). Granted by [Kubernetes Engine Developer (`roles/container.pods.exec`)](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles).
2020-03-13 15:44:24 +05:30
2019-07-31 22:56:46 +05:30
### Log into my Review App
2019-07-07 11:18:12 +05:30
2020-11-24 15:15:51 +05:30
For GitLab Team Members only. If you want to sign in to the review app, review
the GitLab handbook information for the [shared 1Password account](https://about.gitlab.com/handbook/security/#1password-for-teams).
- The default username is `root`.
- The password can be found in the 1Password secure note named `gitlab-{ce,ee} Review App's root password`.
2019-07-07 11:18:12 +05:30
2019-07-31 22:56:46 +05:30
### Enable a feature flag for my Review App
2019-07-07 11:18:12 +05:30
1. Open your Review App and log in as documented above.
1. Create a personal access token.
1. Enable the feature flag using the [Feature flag API](../../api/features.md).
2019-07-31 22:56:46 +05:30
### Find my Review App slug
2019-02-15 15:39:39 +05:30
1. Open the `review-deploy` job.
1. Look for `Checking for previous deployment of review-*`.
1. For instance for `Checking for previous deployment of review-qa-raise-e-12chm0`,
2019-07-31 22:56:46 +05:30
your Review App slug would be `review-qa-raise-e-12chm0` in this case.
2019-02-15 15:39:39 +05:30
2019-07-31 22:56:46 +05:30
### Run a Rails console
2019-02-15 15:39:39 +05:30
2020-05-24 23:13:21 +05:30
1. Make sure you [have access to the cluster](#get-access-to-the-gcp-review-apps-cluster) and the `container.pods.exec` permission first.
2019-09-30 21:07:59 +05:30
1. [Filter Workloads by your Review App slug](https://console.cloud.google.com/kubernetes/workload?project=gitlab-review-apps),
e.g. `review-qa-raise-e-12chm0`.
2019-07-31 22:56:46 +05:30
1. Find and open the `task-runner` Deployment, e.g. `review-qa-raise-e-12chm0-task-runner`.
1. Click on the Pod in the "Managed pods" section, e.g. `review-qa-raise-e-12chm0-task-runner-d5455cc8-2lsvz`.
2019-02-15 15:39:39 +05:30
1. Click on the `KUBECTL` dropdown, then `Exec` -> `task-runner`.
2019-07-07 11:18:12 +05:30
1. Replace `-c task-runner -- ls` with `-it -- gitlab-rails console` from the
2019-07-31 22:56:46 +05:30
default command or
2020-07-28 23:09:34 +05:30
- Run `kubectl exec --namespace review-apps review-qa-raise-e-12chm0-task-runner-d5455cc8-2lsvz -it -- gitlab-rails console` and
2019-07-31 22:56:46 +05:30
- Replace `review-qa-raise-e-12chm0-task-runner-d5455cc8-2lsvz`
with your Pod's name.
2019-02-15 15:39:39 +05:30
2019-07-31 22:56:46 +05:30
### Dig into a Pod's logs
2019-02-15 15:39:39 +05:30
2020-05-24 23:13:21 +05:30
1. Make sure you [have access to the cluster](#get-access-to-the-gcp-review-apps-cluster) and the `container.pods.getLogs` permission first.
2019-07-31 22:56:46 +05:30
1. [Filter Workloads by your Review App slug](https://console.cloud.google.com/kubernetes/workload?project=gitlab-review-apps),
e.g. `review-qa-raise-e-12chm0`.
2019-02-15 15:39:39 +05:30
1. Find and open the `migrations` Deployment, e.g.
2019-07-31 22:56:46 +05:30
`review-qa-raise-e-12chm0-migrations.1`.
2019-02-15 15:39:39 +05:30
1. Click on the Pod in the "Managed pods" section, e.g.
2019-07-31 22:56:46 +05:30
`review-qa-raise-e-12chm0-migrations.1-nqwtx`.
2019-02-15 15:39:39 +05:30
1. Click on the `Container logs` link.
2018-12-05 23:21:45 +05:30
2020-01-01 13:55:28 +05:30
## Diagnosing unhealthy Review App releases
If [Review App Stability](https://app.periscopedata.com/app/gitlab/496118/Engineering-Productivity-Sandbox?widget=6690556&udv=785399)
dips this may be a signal that the `review-apps-ce/ee` cluster is unhealthy.
2020-04-22 19:07:51 +05:30
Leading indicators may be health check failures leading to restarts or majority failure for Review App deployments.
2020-01-01 13:55:28 +05:30
2020-06-23 00:09:42 +05:30
The [Review Apps Overview dashboard](https://console.cloud.google.com/monitoring/classic/dashboards/6798952013815386466?project=gitlab-review-apps&timeDomain=1d)
2020-01-01 13:55:28 +05:30
aids in identifying load spikes on the cluster, and if nodes are problematic or the entire cluster is trending towards unhealthy.
2020-05-24 23:13:21 +05:30
### Release failed with `ImagePullBackOff`
**Potential cause:**
If you see an `ImagePullBackoff` status, check for a missing Docker image.
**Where to look for further debugging:**
To check that the Docker images were created, run the following Docker command:
```shell
`DOCKER_CLI_EXPERIMENTAL=enabled docker manifest repository:tag`
```
The output of this command indicates if the Docker image exists. For example:
```shell
DOCKER_CLI_EXPERIMENTAL=enabled docker manifest inspect registry.gitlab.com/gitlab-org/build/cng-mirror/gitlab-rails-ee:39467-allow-a-release-s-associated-milestones-to-be-edited-thro
```
If the Docker image does not exist:
- Verify the `image.repository` and `image.tag` options in the `helm upgrade --install` command match the repository names used by CNG-mirror pipeline.
- Look further in the corresponding downstream CNG-mirror pipeline in `review-build-cng` job.
2020-01-01 13:55:28 +05:30
### Node count is always increasing (i.e. never stabilizing or decreasing)
**Potential cause:**
2020-04-08 14:13:33 +05:30
That could be a sign that the `review-cleanup` job is
2020-01-01 13:55:28 +05:30
failing to cleanup stale Review Apps and Kubernetes resources.
**Where to look for further debugging:**
2020-04-08 14:13:33 +05:30
Look at the latest `review-cleanup` job log, and identify look for any
2020-01-01 13:55:28 +05:30
unexpected failure.
### p99 CPU utilization is at 100% for most of the nodes and/or many components
**Potential cause:**
This could be a sign that Helm is failing to deploy Review Apps. When Helm has a
lot of `FAILED` releases, it seems that the CPU utilization is increasing, probably
due to Helm or Kubernetes trying to recreate the components.
**Where to look for further debugging:**
2020-04-22 19:07:51 +05:30
Look at a recent `review-deploy` job log.
2020-01-01 13:55:28 +05:30
**Useful commands:**
```shell
# Identify if node spikes are common or load on specific nodes which may get rebalanced by the Kubernetes scheduler
2020-03-13 15:44:24 +05:30
kubectl top nodes | sort --key 3 --numeric
2020-01-01 13:55:28 +05:30
# Identify pods under heavy CPU load
2020-03-13 15:44:24 +05:30
kubectl top pods | sort --key 2 --numeric
2020-01-01 13:55:28 +05:30
```
### The `logging/user/events/FailedMount` chart is going up
**Potential cause:**
2020-06-23 00:09:42 +05:30
This could be a sign that there are too many stale secrets and/or configuration maps.
2020-01-01 13:55:28 +05:30
**Where to look for further debugging:**
Look at [the list of Configurations](https://console.cloud.google.com/kubernetes/config?project=gitlab-review-apps)
or `kubectl get secret,cm --sort-by='{.metadata.creationTimestamp}' | grep 'review-'`.
2020-06-23 00:09:42 +05:30
Any secrets or configuration maps older than 5 days are suspect and should be deleted.
2020-01-01 13:55:28 +05:30
**Useful commands:**
2020-03-13 15:44:24 +05:30
```shell
2020-01-01 13:55:28 +05:30
# List secrets and config maps ordered by created date
2020-03-13 15:44:24 +05:30
kubectl get secret,cm --sort-by='{.metadata.creationTimestamp}' | grep 'review-'
2020-01-01 13:55:28 +05:30
# Delete all secrets that are 5 to 9 days old
2020-03-13 15:44:24 +05:30
kubectl get secret --sort-by='{.metadata.creationTimestamp}' | grep '^review-' | grep '[5-9]d$' | cut -d' ' -f1 | xargs kubectl delete secret
2020-01-01 13:55:28 +05:30
# Delete all secrets that are 10 to 99 days old
2020-03-13 15:44:24 +05:30
kubectl get secret --sort-by='{.metadata.creationTimestamp}' | grep '^review-' | grep '[1-9][0-9]d$' | cut -d' ' -f1 | xargs kubectl delete secret
2020-01-01 13:55:28 +05:30
# Delete all config maps that are 5 to 9 days old
2020-03-13 15:44:24 +05:30
kubectl get cm --sort-by='{.metadata.creationTimestamp}' | grep 'review-' | grep -v 'dns-gitlab-review-app' | grep '[5-9]d$' | cut -d' ' -f1 | xargs kubectl delete cm
2020-01-01 13:55:28 +05:30
# Delete all config maps that are 10 to 99 days old
2020-03-13 15:44:24 +05:30
kubectl get cm --sort-by='{.metadata.creationTimestamp}' | grep 'review-' | grep -v 'dns-gitlab-review-app' | grep '[1-9][0-9]d$' | cut -d' ' -f1 | xargs kubectl delete cm
2020-01-01 13:55:28 +05:30
```
2019-12-04 20:38:33 +05:30
2020-01-01 13:55:28 +05:30
### Using K9s
2019-12-04 20:38:33 +05:30
2021-02-22 17:27:13 +05:30
[K9s](https://github.com/derailed/k9s) is a powerful command line dashboard which allows you to filter by labels. This can help identify trends with apps exceeding the [review-app resource requests](https://gitlab.com/gitlab-org/gitlab/-/blob/master/scripts/review_apps/base-config.yaml). Kubernetes schedules pods to nodes based on resource requests and allow for CPU usage up to the limits.
2019-12-04 20:38:33 +05:30
2020-01-01 13:55:28 +05:30
- In K9s you can sort or add filters by typing the `/` character
- `-lrelease=<review-app-slug>` - filters down to all pods for a release. This aids in determining what is having issues in a single deployment
- `-lapp=<app>` - filters down to all pods for a specific app. This aids in determining resource usage by app.
- You can scroll to a Kubernetes resource and hit `d`(describe), `s`(shell), `l`(logs) for a deeper inspection
2019-12-04 20:38:33 +05:30
![K9s](img/k9s.png)
2019-09-04 21:01:54 +05:30
### Troubleshoot a pending `dns-gitlab-review-app-external-dns` Deployment
#### Finding the problem
2020-06-23 00:09:42 +05:30
[In the past](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/62834), it happened
2019-09-04 21:01:54 +05:30
that the `dns-gitlab-review-app-external-dns` Deployment was in a pending state,
effectively preventing all the Review Apps from getting a DNS record assigned,
making them unreachable via domain name.
This in turn prevented other components of the Review App to properly start
(e.g. `gitlab-runner`).
After some digging, we found that new mounts were failing, when being performed
with transient scopes (e.g. pods) of `systemd-mount`:
2020-03-13 15:44:24 +05:30
```plaintext
2019-09-04 21:01:54 +05:30
MountVolume.SetUp failed for volume "dns-gitlab-review-app-external-dns-token-sj5jm" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/06add1c3-87b4-11e9-80a9-42010a800107/volumes/kubernetes.io~secret/dns-gitlab-review-app-external-dns-token-sj5jm --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/06add1c3-87b4-11e9-80a9-42010a800107/volumes/kubernetes.io~secret/dns-gitlab-review-app-external-dns-token-sj5jm
Output: Failed to start transient scope unit: Connection timed out
```
This probably happened because the GitLab chart creates 67 resources, leading to
a lot of mount points being created on the underlying GCP node.
The [underlying issue seems to be a `systemd` bug](https://github.com/kubernetes/kubernetes/issues/57345#issuecomment-359068048)
that was fixed in `systemd` `v237`. Unfortunately, our GCP nodes are currently
using `v232`.
For the record, the debugging steps to find out this issue were:
2019-12-21 20:55:43 +05:30
1. Switch kubectl context to review-apps-ce (we recommend using [kubectx](https://github.com/ahmetb/kubectx/))
2019-09-04 21:01:54 +05:30
1. `kubectl get pods | grep dns`
1. `kubectl describe pod <pod name>` & confirm exact error message
2019-12-21 20:55:43 +05:30
1. Web search for exact error message, following rabbit hole to [a relevant Kubernetes bug report](https://github.com/kubernetes/kubernetes/issues/57345)
2019-09-04 21:01:54 +05:30
1. Access the node over SSH via the GCP console (**Computer Engine > VM
2019-09-30 21:07:59 +05:30
instances** then click the "SSH" button for the node where the `dns-gitlab-review-app-external-dns` pod runs)
2020-06-23 00:09:42 +05:30
1. In the node: `systemctl --version` => `systemd 232`
2019-09-04 21:01:54 +05:30
1. Gather some more information:
- `mount | grep kube | wc -l` => e.g. 290
- `systemctl list-units --all | grep -i var-lib-kube | wc -l` => e.g. 142
1. Check how many pods are in a bad state:
- Get all pods running a given node: `kubectl get pods --field-selector=spec.nodeName=NODE_NAME`
- Get all the `Running` pods on a given node: `kubectl get pods --field-selector=spec.nodeName=NODE_NAME | grep Running`
- Get all the pods in a bad state on a given node: `kubectl get pods --field-selector=spec.nodeName=NODE_NAME | grep -v 'Running' | grep -v 'Completed'`
#### Solving the problem
To resolve the problem, we needed to (forcibly) drain some nodes:
1. Try a normal drain on the node where the `dns-gitlab-review-app-external-dns`
2019-09-30 21:07:59 +05:30
pod runs so that Kubernetes automatically move it to another node: `kubectl drain NODE_NAME`
2019-09-04 21:01:54 +05:30
1. If that doesn't work, you can also perform a forcible "drain" the node by removing all pods: `kubectl delete pods --field-selector=spec.nodeName=NODE_NAME`
1. In the node:
- Perform `systemctl daemon-reload` to remove the dead/inactive units
- If that doesn't solve the problem, perform a hard reboot: `sudo systemctl reboot`
1. Uncordon any cordoned nodes: `kubectl uncordon NODE_NAME`
In parallel, since most Review Apps were in a broken state, we deleted them to
clean up the list of non-`Running` pods.
Following is a command to delete Review Apps based on their last deployment date
(current date was June 6th at the time) with
2020-03-13 15:44:24 +05:30
```shell
2019-09-04 21:01:54 +05:30
helm ls -d | grep "Jun 4" | cut -f1 | xargs helm delete --purge
```
#### Mitigation steps taken to avoid this problem in the future
2021-02-22 17:27:13 +05:30
We've created a new node pool with smaller machines to reduce the risk
that a machine reaches the "too many mount points" problem in the future.
2019-09-04 21:01:54 +05:30
2018-12-05 23:21:45 +05:30
## Frequently Asked Questions
2019-02-15 15:39:39 +05:30
**Isn't it too much to trigger CNG image builds on every test run? This creates
thousands of unused Docker images.**
2018-12-05 23:21:45 +05:30
2019-02-15 15:39:39 +05:30
> We have to start somewhere and improve later. Also, we're using the
2019-12-04 20:38:33 +05:30
> CNG-mirror project to store these Docker images so that we can just wipe out
> the registry at some point, and use a new fresh, empty one.
2018-12-05 23:21:45 +05:30
2019-02-15 15:39:39 +05:30
**How do we secure this from abuse? Apps are open to the world so we need to
find a way to limit it to only us.**
2018-12-05 23:21:45 +05:30
2019-02-15 15:39:39 +05:30
> This isn't enabled for forks.
2018-12-05 23:21:45 +05:30
2019-07-07 11:18:12 +05:30
## Other resources
2019-09-30 21:07:59 +05:30
- [Review Apps integration for CE/EE (presentation)](https://docs.google.com/presentation/d/1QPLr6FO4LduROU8pQIPkX1yfGvD13GEJIBOenqoKxR8/edit?usp=sharing)
2020-06-23 00:09:42 +05:30
- [Stability issues](https://gitlab.com/gitlab-org/quality/team-tasks/-/issues/212)
2019-12-04 20:38:33 +05:30
### Helpful command line tools
2020-04-22 19:07:51 +05:30
- [K9s](https://github.com/derailed/k9s) - enables CLI dashboard across pods and enabling filtering by labels
2019-12-04 20:38:33 +05:30
- [Stern](https://github.com/wercker/stern) - enables cross pod log tailing based on label/field selectors
2019-07-07 11:18:12 +05:30
2018-12-05 23:21:45 +05:30
---
[Return to Testing documentation](index.md)