195 lines
7.9 KiB
Markdown
195 lines
7.9 KiB
Markdown
---
|
|
type: reference
|
|
---
|
|
|
|
# Working with the bundled Consul service **(PREMIUM ONLY)**
|
|
|
|
As part of its High Availability stack, GitLab Premium includes a bundled version of [Consul](https://www.consul.io/) that can be managed through `/etc/gitlab/gitlab.rb`.
|
|
|
|
A Consul cluster consists of multiple server agents, as well as client agents that run on other nodes which need to talk to the Consul cluster.
|
|
|
|
## Prerequisites
|
|
|
|
First, make sure to [download/install](https://about.gitlab.com/install/)
|
|
GitLab Omnibus **on each node**.
|
|
|
|
Choose an installation method, then make sure you complete steps:
|
|
|
|
1. Install and configure the necessary dependencies.
|
|
1. Add the GitLab package repository and install the package.
|
|
|
|
When installing the GitLab package, do not supply `EXTERNAL_URL` value.
|
|
|
|
## Configuring the Consul nodes
|
|
|
|
On each Consul node perform the following:
|
|
|
|
1. Make sure you collect [`CONSUL_SERVER_NODES`](database.md#consul-information), which are the IP addresses or DNS records of the Consul server nodes, for the next step, before executing the next step.
|
|
|
|
1. Edit `/etc/gitlab/gitlab.rb` replacing values noted in the `# START user configuration` section:
|
|
|
|
```ruby
|
|
# Disable all components except Consul
|
|
roles ['consul_role']
|
|
|
|
# START user configuration
|
|
# Replace placeholders:
|
|
#
|
|
# Y.Y.Y.Y consul1.gitlab.example.com Z.Z.Z.Z
|
|
# with the addresses gathered for CONSUL_SERVER_NODES
|
|
consul['configuration'] = {
|
|
server: true,
|
|
retry_join: %w(Y.Y.Y.Y consul1.gitlab.example.com Z.Z.Z.Z)
|
|
}
|
|
|
|
# Disable auto migrations
|
|
gitlab_rails['auto_migrate'] = false
|
|
#
|
|
# END user configuration
|
|
```
|
|
|
|
> `consul_role` was introduced with GitLab 10.3
|
|
|
|
1. [Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes
|
|
to take effect.
|
|
|
|
### Consul checkpoint
|
|
|
|
Before moving on, make sure Consul is configured correctly. Run the following
|
|
command to verify all server nodes are communicating:
|
|
|
|
```shell
|
|
/opt/gitlab/embedded/bin/consul members
|
|
```
|
|
|
|
The output should be similar to:
|
|
|
|
```plaintext
|
|
Node Address Status Type Build Protocol DC
|
|
CONSUL_NODE_ONE XXX.XXX.XXX.YYY:8301 alive server 0.9.2 2 gitlab_consul
|
|
CONSUL_NODE_TWO XXX.XXX.XXX.YYY:8301 alive server 0.9.2 2 gitlab_consul
|
|
CONSUL_NODE_THREE XXX.XXX.XXX.YYY:8301 alive server 0.9.2 2 gitlab_consul
|
|
```
|
|
|
|
If any of the nodes isn't `alive` or if any of the three nodes are missing,
|
|
check the [Troubleshooting section](#troubleshooting) before proceeding.
|
|
|
|
## Operations
|
|
|
|
### Checking cluster membership
|
|
|
|
To see which nodes are part of the cluster, run the following on any member in the cluster
|
|
|
|
```shell
|
|
$ /opt/gitlab/embedded/bin/consul members
|
|
Node Address Status Type Build Protocol DC
|
|
consul-b XX.XX.X.Y:8301 alive server 0.9.0 2 gitlab_consul
|
|
consul-c XX.XX.X.Y:8301 alive server 0.9.0 2 gitlab_consul
|
|
consul-c XX.XX.X.Y:8301 alive server 0.9.0 2 gitlab_consul
|
|
db-a XX.XX.X.Y:8301 alive client 0.9.0 2 gitlab_consul
|
|
db-b XX.XX.X.Y:8301 alive client 0.9.0 2 gitlab_consul
|
|
```
|
|
|
|
Ideally all nodes will have a `Status` of `alive`.
|
|
|
|
### Restarting the server cluster
|
|
|
|
NOTE: **Note:**
|
|
This section only applies to server agents. It is safe to restart client agents whenever needed.
|
|
|
|
If it is necessary to restart the server cluster, it is important to do this in a controlled fashion in order to maintain quorum. If quorum is lost, you will need to follow the Consul [outage recovery](#outage-recovery) process to recover the cluster.
|
|
|
|
To be safe, we recommend you only restart one server agent at a time to ensure the cluster remains intact.
|
|
|
|
For larger clusters, it is possible to restart multiple agents at a time. See the [Consul consensus document](https://www.consul.io/docs/internals/consensus.html#deployment-table) for how many failures it can tolerate. This will be the number of simultaneous restarts it can sustain.
|
|
|
|
## Upgrades for bundled Consul
|
|
|
|
Nodes running GitLab-bundled Consul should be:
|
|
|
|
- Members of a healthy cluster prior to upgrading the GitLab Omnibus package.
|
|
- Upgraded one node at a time.
|
|
|
|
NOTE: **NOTE:**
|
|
Running `curl http://127.0.0.1:8500/v1/health/state/critical` from any Consul node will identify existing health issues in the cluster. The command will return an empty array if the cluster is healthy.
|
|
|
|
Consul clusters communicate using the raft protocol. If the current leader goes offline, there needs to be a leader election. A leader node must exist to facilitate synchronization across the cluster. If too many nodes go offline at the same time, the cluster will lose quorum and not elect a leader due to [broken consensus](https://www.consul.io/docs/internals/consensus.html).
|
|
|
|
Consult the [troubleshooting section](#troubleshooting) if the cluster is not able to recover after the upgrade. The [outage recovery](#outage-recovery) may be of particular interest.
|
|
|
|
NOTE: **NOTE:**
|
|
GitLab only uses Consul to store transient data that is easily regenerated. If the bundled Consul was not used by any process other than GitLab itself, then [rebuilding the cluster from scratch](#recreate-from-scratch) is fine.
|
|
|
|
## Troubleshooting
|
|
|
|
### Consul server agents unable to communicate
|
|
|
|
By default, the server agents will attempt to [bind](https://www.consul.io/docs/agent/options.html#_bind) to '0.0.0.0', but they will advertise the first private IP address on the node for other agents to communicate with them. If the other nodes cannot communicate with a node on this address, then the cluster will have a failed status.
|
|
|
|
You will see messages like the following in `gitlab-ctl tail consul` output if you are running into this issue:
|
|
|
|
```plaintext
|
|
2017-09-25_19:53:39.90821 2017/09/25 19:53:39 [WARN] raft: no known peers, aborting election
|
|
2017-09-25_19:53:41.74356 2017/09/25 19:53:41 [ERR] agent: failed to sync remote state: No cluster leader
|
|
```
|
|
|
|
To fix this:
|
|
|
|
1. Pick an address on each node that all of the other nodes can reach this node through.
|
|
1. Update your `/etc/gitlab/gitlab.rb`
|
|
|
|
```ruby
|
|
consul['configuration'] = {
|
|
...
|
|
bind_addr: 'IP ADDRESS'
|
|
}
|
|
```
|
|
|
|
1. Run `gitlab-ctl reconfigure`
|
|
|
|
If you still see the errors, you may have to [erase the Consul database and reinitialize](#recreate-from-scratch) on the affected node.
|
|
|
|
### Consul agents do not start - Multiple private IPs
|
|
|
|
In the case that a node has multiple private IPs the agent be confused as to which of the private addresses to advertise, and then immediately exit on start.
|
|
|
|
You will see messages like the following in `gitlab-ctl tail consul` output if you are running into this issue:
|
|
|
|
```plaintext
|
|
2017-11-09_17:41:45.52876 ==> Starting Consul agent...
|
|
2017-11-09_17:41:45.53057 ==> Error creating agent: Failed to get advertise address: Multiple private IPs found. Please configure one.
|
|
```
|
|
|
|
To fix this:
|
|
|
|
1. Pick an address on the node that all of the other nodes can reach this node through.
|
|
1. Update your `/etc/gitlab/gitlab.rb`
|
|
|
|
```ruby
|
|
consul['configuration'] = {
|
|
...
|
|
bind_addr: 'IP ADDRESS'
|
|
}
|
|
```
|
|
|
|
1. Run `gitlab-ctl reconfigure`
|
|
|
|
### Outage recovery
|
|
|
|
If you lost enough server agents in the cluster to break quorum, then the cluster is considered failed, and it will not function without manual intervention.
|
|
|
|
#### Recreate from scratch
|
|
|
|
By default, GitLab does not store anything in the Consul cluster that cannot be recreated. To erase the Consul database and reinitialize
|
|
|
|
```shell
|
|
gitlab-ctl stop consul
|
|
rm -rf /var/opt/gitlab/consul/data
|
|
gitlab-ctl start consul
|
|
```
|
|
|
|
After this, the cluster should start back up, and the server agents rejoin. Shortly after that, the client agents should rejoin as well.
|
|
|
|
#### Recover a failed cluster
|
|
|
|
If you have taken advantage of Consul to store other data, and want to restore the failed cluster, please follow the [Consul guide](https://learn.hashicorp.com/consul/day-2-operations/outage) to recover a failed cluster.
|