270 lines
11 KiB
Markdown
270 lines
11 KiB
Markdown
|
---
|
||
|
stage: Enablement
|
||
|
group: Geo
|
||
|
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers
|
||
|
type: howto
|
||
|
---
|
||
|
|
||
|
CAUTION: **Caution:**
|
||
|
This runbook is in **alpha**. For complete, production-ready documentation, see the
|
||
|
[disaster recovery documentation](index.md).
|
||
|
|
||
|
# Disaster Recovery (Geo) promotion runbooks **(PREMIUM ONLY)**
|
||
|
|
||
|
## Geo planned failover runbook 1
|
||
|
|
||
|
| Component | Configuration |
|
||
|
| ----------- | --------------- |
|
||
|
| PostgreSQL | Omnibus-managed |
|
||
|
| Geo site | Single-node |
|
||
|
| Secondaries | One |
|
||
|
|
||
|
This runbook will guide you through a planned failover of a single-node Geo site
|
||
|
with one secondary. The following general architecture is assumed:
|
||
|
|
||
|
```mermaid
|
||
|
graph TD
|
||
|
subgraph main[Geo deployment]
|
||
|
subgraph Primary[Primary site]
|
||
|
Node_1[(GitLab node)]
|
||
|
end
|
||
|
subgraph Secondary1[Secondary site]
|
||
|
Node_2[(GitLab node)]
|
||
|
end
|
||
|
end
|
||
|
```
|
||
|
|
||
|
This guide will result in the following:
|
||
|
|
||
|
1. An offline primary.
|
||
|
1. A promoted secondary that is now the new primary.
|
||
|
|
||
|
What is not covered:
|
||
|
|
||
|
1. Re-adding the old **primary** as a secondary.
|
||
|
1. Adding a new secondary.
|
||
|
|
||
|
### Preparation
|
||
|
|
||
|
NOTE: **Note:**
|
||
|
Before following any of those steps, make sure you have `root` access to the
|
||
|
**secondary** to promote it, since there isn't provided an automated way to
|
||
|
promote a Geo replica and perform a failover.
|
||
|
|
||
|
On the **secondary** node, navigate to the **Admin Area > Geo** dashboard to
|
||
|
review its status. Replicated objects (shown in green) should be close to 100%,
|
||
|
and there should be no failures (shown in red). If a large proportion of
|
||
|
objects aren't yet replicated (shown in gray), consider giving the node more
|
||
|
time to complete.
|
||
|
|
||
|
![Replication status](img/replication-status.png)
|
||
|
|
||
|
If any objects are failing to replicate, this should be investigated before
|
||
|
scheduling the maintenance window. After a planned failover, anything that
|
||
|
failed to replicate will be **lost**.
|
||
|
|
||
|
You can use the
|
||
|
[Geo status API](../../../api/geo_nodes.md#retrieve-project-sync-or-verification-failures-that-occurred-on-the-current-node)
|
||
|
to review failed objects and the reasons for failure.
|
||
|
A common cause of replication failures is the data being missing on the
|
||
|
**primary** node - you can resolve these failures by restoring the data from backup,
|
||
|
or removing references to the missing data.
|
||
|
|
||
|
The maintenance window won't end until Geo replication and verification is
|
||
|
completely finished. To keep the window as short as possible, you should
|
||
|
ensure these processes are close to 100% as possible during active use.
|
||
|
|
||
|
If the **secondary** node is still replicating data from the **primary** node,
|
||
|
follow these steps to avoid unnecessary data loss:
|
||
|
|
||
|
1. Until a [read-only mode](https://gitlab.com/gitlab-org/gitlab/-/issues/14609)
|
||
|
is implemented, updates must be prevented from happening manually to the
|
||
|
**primary**. Note that your **secondary** node still needs read-only
|
||
|
access to the **primary** node during the maintenance window:
|
||
|
|
||
|
1. At the scheduled time, using your cloud provider or your node's firewall, block
|
||
|
all HTTP, HTTPS and SSH traffic to/from the **primary** node, **except** for your IP and
|
||
|
the **secondary** node's IP.
|
||
|
|
||
|
For instance, you can run the following commands on the **primary** node:
|
||
|
|
||
|
```shell
|
||
|
sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 22 -j ACCEPT
|
||
|
sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 22 -j ACCEPT
|
||
|
sudo iptables -A INPUT --destination-port 22 -j REJECT
|
||
|
|
||
|
sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 80 -j ACCEPT
|
||
|
sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 80 -j ACCEPT
|
||
|
sudo iptables -A INPUT --tcp-dport 80 -j REJECT
|
||
|
|
||
|
sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 443 -j ACCEPT
|
||
|
sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 443 -j ACCEPT
|
||
|
sudo iptables -A INPUT --tcp-dport 443 -j REJECT
|
||
|
```
|
||
|
|
||
|
From this point, users will be unable to view their data or make changes on the
|
||
|
**primary** node. They will also be unable to log in to the **secondary** node.
|
||
|
However, existing sessions will work for the remainder of the maintenance period, and
|
||
|
public data will be accessible throughout.
|
||
|
|
||
|
1. Verify the **primary** node is blocked to HTTP traffic by visiting it in browser via
|
||
|
another IP. The server should refuse connection.
|
||
|
|
||
|
1. Verify the **primary** node is blocked to Git over SSH traffic by attempting to pull an
|
||
|
existing Git repository with an SSH remote URL. The server should refuse
|
||
|
connection.
|
||
|
|
||
|
1. On the **primary** node, disable non-Geo periodic background jobs by navigating
|
||
|
to **Admin Area > Monitoring > Background Jobs > Cron**, clicking `Disable All`,
|
||
|
and then clicking `Enable` for the `geo_sidekiq_cron_config_worker` cron job.
|
||
|
This job will re-enable several other cron jobs that are essential for planned
|
||
|
failover to complete successfully.
|
||
|
|
||
|
1. Finish replicating and verifying all data:
|
||
|
|
||
|
CAUTION: **Caution:**
|
||
|
Not all data is automatically replicated. Read more about
|
||
|
[what is excluded](planned_failover.md#not-all-data-is-automatically-replicated).
|
||
|
|
||
|
1. If you are manually replicating any
|
||
|
[data not managed by Geo](../replication/datatypes.md#limitations-on-replicationverification),
|
||
|
trigger the final replication process now.
|
||
|
1. On the **primary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues**
|
||
|
and wait for all queues except those with `geo` in the name to drop to 0.
|
||
|
These queues contain work that has been submitted by your users; failing over
|
||
|
before it is completed will cause the work to be lost.
|
||
|
1. On the **primary** node, navigate to **Admin Area > Geo** and wait for the
|
||
|
following conditions to be true of the **secondary** node you are failing over to:
|
||
|
- All replication meters to each 100% replicated, 0% failures.
|
||
|
- All verification meters reach 100% verified, 0% failures.
|
||
|
- Database replication lag is 0ms.
|
||
|
- The Geo log cursor is up to date (0 events behind).
|
||
|
|
||
|
1. On the **secondary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues**
|
||
|
and wait for all the `geo` queues to drop to 0 queued and 0 running jobs.
|
||
|
1. On the **secondary** node, use [these instructions](../../raketasks/check.md)
|
||
|
to verify the integrity of CI artifacts, LFS objects, and uploads in file
|
||
|
storage.
|
||
|
|
||
|
At this point, your **secondary** node will contain an up-to-date copy of everything the
|
||
|
**primary** node has, meaning nothing will be lost when you fail over.
|
||
|
|
||
|
1. In this final step, you need to permanently disable the **primary** node.
|
||
|
|
||
|
CAUTION: **Caution:**
|
||
|
When the **primary** node goes offline, there may be data saved on the **primary** node
|
||
|
that has not been replicated to the **secondary** node. This data should be treated
|
||
|
as lost if you proceed.
|
||
|
|
||
|
TIP: **Tip:**
|
||
|
If you plan to [update the **primary** domain DNS record](index.md#step-4-optional-updating-the-primary-domain-dns-record),
|
||
|
you may wish to lower the TTL now to speed up propagation.
|
||
|
|
||
|
When performing a failover, we want to avoid a split-brain situation where
|
||
|
writes can occur in two different GitLab instances. So to prepare for the
|
||
|
failover, you must disable the **primary** node:
|
||
|
|
||
|
- If you have SSH access to the **primary** node, stop and disable GitLab:
|
||
|
|
||
|
```shell
|
||
|
sudo gitlab-ctl stop
|
||
|
```
|
||
|
|
||
|
Prevent GitLab from starting up again if the server unexpectedly reboots:
|
||
|
|
||
|
```shell
|
||
|
sudo systemctl disable gitlab-runsvdir
|
||
|
```
|
||
|
|
||
|
NOTE: **Note:**
|
||
|
(**CentOS only**) In CentOS 6 or older, there is no easy way to prevent GitLab from being
|
||
|
started if the machine reboots isn't available (see [Omnibus GitLab issue #3058](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/3058)).
|
||
|
It may be safest to uninstall the GitLab package completely with `sudo yum remove gitlab-ee`.
|
||
|
|
||
|
NOTE: **Note:**
|
||
|
(**Ubuntu 14.04 LTS**) If you are using an older version of Ubuntu
|
||
|
or any other distribution based on the Upstart init system, you can prevent GitLab
|
||
|
from starting if the machine reboots as `root` with
|
||
|
`initctl stop gitlab-runsvvdir && echo 'manual' > /etc/init/gitlab-runsvdir.override && initctl reload-configuration`.
|
||
|
|
||
|
- If you do not have SSH access to the **primary** node, take the machine offline and
|
||
|
prevent it from rebooting. Since there are many ways you may prefer to accomplish
|
||
|
this, we will avoid a single recommendation. You may need to:
|
||
|
|
||
|
- Reconfigure the load balancers.
|
||
|
- Change DNS records (for example, point the **primary** DNS record to the **secondary**
|
||
|
node in order to stop usage of the **primary** node).
|
||
|
- Stop the virtual servers.
|
||
|
- Block traffic through a firewall.
|
||
|
- Revoke object storage permissions from the **primary** node.
|
||
|
- Physically disconnect a machine.
|
||
|
|
||
|
### Promoting the **secondary** node
|
||
|
|
||
|
Note the following when promoting a secondary:
|
||
|
|
||
|
- A new **secondary** should not be added at this time. If you want to add a new
|
||
|
**secondary**, do this after you have completed the entire process of promoting
|
||
|
the **secondary** to the **primary**.
|
||
|
- If you encounter an `ActiveRecord::RecordInvalid: Validation failed: Name has already been taken`
|
||
|
error during this process, read
|
||
|
[the troubleshooting advice](../replication/troubleshooting.md#fixing-errors-during-a-failover-or-when-promoting-a-secondary-to-a-primary-node).
|
||
|
|
||
|
To promote the secondary node:
|
||
|
|
||
|
1. SSH in to your **secondary** node and login as root:
|
||
|
|
||
|
```shell
|
||
|
sudo -i
|
||
|
```
|
||
|
|
||
|
1. Edit `/etc/gitlab/gitlab.rb` to reflect its new status as **primary** by
|
||
|
removing any lines that enabled the `geo_secondary_role`:
|
||
|
|
||
|
```ruby
|
||
|
## In pre-11.5 documentation, the role was enabled as follows. Remove this line.
|
||
|
geo_secondary_role['enable'] = true
|
||
|
|
||
|
## In 11.5+ documentation, the role was enabled as follows. Remove this line.
|
||
|
roles ['geo_secondary_role']
|
||
|
```
|
||
|
|
||
|
1. Run the following command to list out all preflight checks and automatically
|
||
|
check if replication and verification are complete before scheduling a planned
|
||
|
failover to ensure the process will go smoothly:
|
||
|
|
||
|
```shell
|
||
|
gitlab-ctl promotion-preflight-checks
|
||
|
```
|
||
|
|
||
|
1. Promote the **secondary**:
|
||
|
|
||
|
```shell
|
||
|
gitlab-ctl promote-to-primary-node
|
||
|
```
|
||
|
|
||
|
If you have already run the [preflight checks](planned_failover.md#preflight-checks)
|
||
|
or don't want to run them, you can skip them:
|
||
|
|
||
|
```shell
|
||
|
gitlab-ctl promote-to-primary-node --skip-preflight-check
|
||
|
```
|
||
|
|
||
|
You can also promote the secondary node to primary **without any further confirmation**, even when preflight checks fail:
|
||
|
|
||
|
```shell
|
||
|
sudo gitlab-ctl promote-to-primary-node --force
|
||
|
```
|
||
|
|
||
|
1. Verify you can connect to the newly promoted **primary** node using the URL used
|
||
|
previously for the **secondary** node.
|
||
|
|
||
|
If successful, the **secondary** node has now been promoted to the **primary** node.
|
||
|
|
||
|
### Next steps
|
||
|
|
||
|
To regain geographic redundancy as quickly as possible, you should
|
||
|
[add a new **secondary** node](../setup/index.md). To
|
||
|
do that, you can re-add the old **primary** as a new secondary and bring it back
|
||
|
online.
|