219 lines
10 KiB
Markdown
219 lines
10 KiB
Markdown
---
|
|
stage: none
|
|
group: unassigned
|
|
comments: false
|
|
description: 'Improve scalability of GitLab CI/CD'
|
|
---
|
|
|
|
# CI/CD Scaling
|
|
|
|
## Summary
|
|
|
|
GitLab CI/CD is one of the most data and compute intensive components of GitLab.
|
|
Since its [initial release in November 2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/),
|
|
the CI/CD subsystem has evolved significantly. It was [integrated into GitLab in September 2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/)
|
|
and has become [one of the most beloved CI/CD solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).
|
|
|
|
GitLab CI/CD has come a long way since the initial release, but the design of
|
|
the data storage for pipeline builds remains almost the same since 2012. We
|
|
store all the builds in PostgreSQL in `ci_builds` table, and because we are
|
|
creating more than 5 million builds each day on GitLab.com we are reaching
|
|
database limits that are slowing our development velocity down.
|
|
|
|
On February 1st, 2021, GitLab.com surpassed 1 billion CI/CD builds created. In
|
|
February 2022 we reached 2 billion of CI/CD build stored in the database. The
|
|
number of builds continues to grow exponentially.
|
|
|
|
The screenshot below shows our forecast created at the beginning of 2021, that
|
|
turned out to be quite accurate.
|
|
|
|
![CI builds cumulative with forecast](ci_builds_cumulative_forecast.png)
|
|
|
|
## Goals
|
|
|
|
**Enable future growth by making processing 20M builds in a day possible.**
|
|
|
|
## Challenges
|
|
|
|
The current state of CI/CD product architecture needs to be updated if we want
|
|
to sustain future growth.
|
|
|
|
### We were running out of the capacity to store primary keys: DONE
|
|
|
|
The primary key in `ci_builds` table is an integer value, generated in a sequence.
|
|
Historically, Rails used to use [integer](https://www.postgresql.org/docs/14/datatype-numeric.html)
|
|
type when creating primary keys for a table. We did use the default when we
|
|
[created the `ci_builds` table in 2012](https://gitlab.com/gitlab-org/gitlab/-/blob/046b28312704f3131e72dcd2dbdacc5264d4aa62/db/ci/migrate/20121004165038_create_builds.rb).
|
|
[The behavior of Rails has changed](https://github.com/rails/rails/pull/26266)
|
|
since the release of Rails 5. The framework is now using `bigint` type that is 8
|
|
bytes long, however we have not migrated primary keys for `ci_builds` table to
|
|
`bigint` yet.
|
|
|
|
In early 2021 we had estimated that would run out of the capacity of the integer
|
|
type to store primary keys in `ci_builds` table before December 2021. If it had
|
|
happened without a viable workaround or an emergency plan, GitLab.com would go
|
|
down. `ci_builds` was just one of many tables that were running out of the
|
|
primary keys available in Int4 sequence.
|
|
|
|
Before October 2021, our Database team had managed to migrate all the risky
|
|
tables' primary keys to big integers.
|
|
|
|
See the [related Epic](https://gitlab.com/groups/gitlab-org/-/epics/5657) for more details.
|
|
|
|
### Some CI/CD database tables are too large: IN PROGRESS
|
|
|
|
There is more than two billion rows in `ci_builds` table. We store many
|
|
terabytes of data in that table, and the total size of indexes is measured in
|
|
terabytes as well.
|
|
|
|
This amount of data contributes to a significant number of performance
|
|
problems we experience on our CI PostgreSQL database.
|
|
|
|
Most of the problems are related to how PostgreSQL database works internally,
|
|
and how it is making use of resources on a node the database runs on. We are at
|
|
the limits of vertical scaling of the CI primary database nodes and we
|
|
frequently see a negative impact of the `ci_builds` table on the overall
|
|
performance, stability, scalability and predictability of the CI database
|
|
GitLab.com depends on.
|
|
|
|
The size of the table also hinders development velocity because queries that
|
|
seem fine in the development environment may not work on GitLab.com. The
|
|
difference in the dataset size between the environments makes it difficult to
|
|
predict the performance of even the most simple queries.
|
|
|
|
Team members and the wider community members are struggling to contribute the
|
|
Verify area, because we restricted the possibility of extending `ci_builds`
|
|
even further. Our static analysis tools prevent adding more columns to this
|
|
table. Adding new queries is unpredictable because of the size of the dataset
|
|
and the amount of queries executed using the table. This significantly hinders
|
|
the development velocity and contributes to incidents on the production
|
|
environment.
|
|
|
|
We also expect a significant, exponential growth in the upcoming years.
|
|
|
|
One of the forecasts done using [Facebook's Prophet](https://facebook.github.io/prophet/)
|
|
shows that in the first half of 2024 we expect seeing 20M builds created on
|
|
GitLab.com each day. In comparison to around 5M we see created today. This is
|
|
10x growth from numbers we saw in 2021.
|
|
|
|
![CI builds daily forecast](ci_builds_daily_forecast.png)
|
|
|
|
**Status**: As of October 2021 we reduced the growth rate of `ci_builds` table
|
|
by writing build options and variables to `ci_builds_metadata` table. We are
|
|
also working on partitioning the largest CI/CD database tables using
|
|
[time decay pattern](../ci_data_decay/index.md).
|
|
|
|
### Queuing mechanisms were using the large table: DONE
|
|
|
|
Because of how large the table is, mechanisms that we used to build queues of
|
|
pending builds (there is more than one queue), were not very efficient. Pending
|
|
builds represented a small fraction of what we store in the `ci_builds` table,
|
|
yet we needed to find them in this big dataset to determine an order in which we
|
|
wanted to process them.
|
|
|
|
This mechanism was very inefficient, and it had been causing problems on the
|
|
production environment frequently. This usually resulted in a significant drop
|
|
of the CI/CD Apdex score, and sometimes even caused a significant performance
|
|
degradation in the production environment.
|
|
|
|
There were multiple other strategies that we considered to improve performance and
|
|
reliability. We evaluated using [Redis queuing](https://gitlab.com/gitlab-org/gitlab/-/issues/322972), or
|
|
[a separate table that would accelerate SQL queries used to build queues](https://gitlab.com/gitlab-org/gitlab/-/issues/322766).
|
|
We decided to proceed with the latter.
|
|
|
|
In October 2021 we finished shipping the new architecture of builds queuing
|
|
[on GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/5909#note_680407908).
|
|
We then made the new architecture [generally available](https://gitlab.com/groups/gitlab-org/-/epics/6954).
|
|
|
|
### Moving big amounts of data is challenging: IN PROGRESS
|
|
|
|
We store a significant amount of data in `ci_builds` table. Some of the columns
|
|
in that table store a serialized user-provided data. Column `ci_builds.options`
|
|
stores more than 600 gigabytes of data, and `ci_builds.yaml_variables` more
|
|
than 300 gigabytes (as of February 2021).
|
|
|
|
It is a lot of data that needs to be reliably moved to a different place.
|
|
Unfortunately, right now, our [background migrations](../../../development/database/background_migrations.md)
|
|
are not reliable enough to migrate this amount of data at scale. We need to
|
|
build mechanisms that will give us confidence in moving this data between
|
|
columns, tables, partitions or database shards.
|
|
|
|
Effort to improve background migrations will be owned by our Database Team.
|
|
|
|
**Status**: In progress. We plan to ship further improvements that will be
|
|
described in a separate architectural blueprint.
|
|
|
|
## Proposal
|
|
|
|
Below you can find the original proposal made in early 2021 about how we want
|
|
to move forward with CI Scaling effort:
|
|
|
|
> Making GitLab CI/CD product ready for the scale we expect to see in the
|
|
> upcoming years is a multi-phase effort.
|
|
>
|
|
> First, we want to focus on things that are urgently needed right now. We need
|
|
> to fix primary keys overflow risk and unblock other teams that are working on
|
|
> database partitioning and sharding.
|
|
>
|
|
> We want to improve known bottlenecks, like
|
|
> builds queuing mechanisms that is using the large table, and other things that
|
|
> are holding other teams back.
|
|
>
|
|
> Extending CI/CD metrics is important to get a better sense of how the system
|
|
> performs and to what growth should we expect. This will make it easier for us
|
|
> to identify bottlenecks and perform more advanced capacity planning.
|
|
>
|
|
> Next step is to better understand how we can leverage strong time-decay
|
|
> characteristic of CI/CD data. This might help us to partition CI/CD dataset to
|
|
> reduce the size of CI/CD database tables.
|
|
|
|
## Iterations
|
|
|
|
Work required to achieve our next CI/CD scaling target is tracked in the
|
|
[CI/CD Scaling](https://gitlab.com/groups/gitlab-org/-/epics/5745) epic.
|
|
|
|
1. ✓ Migrate primary keys to big integers on GitLab.com.
|
|
1. ✓ Implement the new architecture of builds queuing on GitLab.com.
|
|
1. ✓ [Make the new builds queuing architecture generally available](https://gitlab.com/groups/gitlab-org/-/epics/6954).
|
|
1. [Partition CI/CD data using time-decay pattern](../ci_data_decay/index.md).
|
|
|
|
## Status
|
|
|
|
Created at 21.01.2021, approved at 26.04.2021.
|
|
|
|
Status: In progress.
|
|
|
|
## Who
|
|
|
|
Proposal:
|
|
|
|
<!-- vale gitlab.Spelling = NO -->
|
|
|
|
| Role | Who
|
|
|------------------------------|-------------------------|
|
|
| Author | Grzegorz Bizon |
|
|
| Architecture Evolution Coach | Kamil Trzciński |
|
|
| Engineering Leader | Cheryl Li |
|
|
| Product Manager | Jackie Porter |
|
|
| Domain Expert / Verify | Fabio Pitino |
|
|
| Domain Expert / Database | Jose Finotto |
|
|
| Domain Expert / PostgreSQL | Nikolay Samokhvalov |
|
|
|
|
DRIs:
|
|
|
|
| Role | Who
|
|
|------------------------------|------------------------|
|
|
| Leadership | Cheryl Li |
|
|
| Product | Jackie Porter |
|
|
| Engineering | Grzegorz Bizon |
|
|
|
|
Domain experts:
|
|
|
|
| Area | Who
|
|
|------------------------------|------------------------|
|
|
| Domain Expert / Verify | Fabio Pitino |
|
|
| Domain Expert / Verify | Marius Bobin |
|
|
| Domain Expert / Database | Jose Finotto |
|
|
| Domain Expert / PostgreSQL | Nikolay Samokhvalov |
|
|
|
|
<!-- vale gitlab.Spelling = YES -->
|