235 lines
11 KiB
Markdown
235 lines
11 KiB
Markdown
---
|
||
type: reference, dev
|
||
stage: none
|
||
group: Verify
|
||
---
|
||
|
||
# Contribute to Verify stage codebase
|
||
|
||
## What are we working on in Verify?
|
||
|
||
Verify stage is working on a comprehensive Continuous Integration platform
|
||
integrated into the GitLab product. Our goal is to empower our users to make
|
||
great technical and business decisions, by delivering a fast, reliable, secure
|
||
platform that verifies assumptions that our users make, and check them against
|
||
the criteria defined in CI/CD configuration. They could be unit tests, end-to-end
|
||
tests, benchmarking, performance validation, code coverage enforcement, and so on.
|
||
|
||
Feedback delivered by GitLab CI/CD makes it possible for our users to make well
|
||
informed decisions about technological and business choices they need to make
|
||
to succeed. Why is Continuous Integration a mission critical product?
|
||
|
||
GitLab CI/CD is our platform to deliver feedback to our users and customers.
|
||
|
||
They contribute their continuous integration configuration files
|
||
`.gitlab-ci.yml` to describe the questions they want to get answers for. Each
|
||
time someone pushes a commit or triggers a pipeline we need to find answers for
|
||
very important questions that have been asked in CI/CD configuration.
|
||
|
||
Failing to answer these questions or, what might be even worse, providing false
|
||
answers, might result in a user making a wrong decision. Such wrong decisions
|
||
can have very severe consequences.
|
||
|
||
## Core principles of our CI/CD platform
|
||
|
||
Data produced by the platform should be:
|
||
|
||
1. Accurate.
|
||
1. Durable.
|
||
1. Accessible.
|
||
|
||
The platform itself should be:
|
||
|
||
1. Reliable.
|
||
1. Secure.
|
||
1. Deterministic.
|
||
1. Trustworthy.
|
||
1. Fast.
|
||
1. Simple.
|
||
|
||
Since the inception of GitLab CI/CD, we have lived by these principles,
|
||
and they serve us and our users well. Some examples of these principles are that:
|
||
|
||
- The feedback delivered by GitLab CI/CD and data produced by the platform should be accurate.
|
||
If a job fails and we notify a user that it was successful, it can have severe negative consequences.
|
||
- Feedback needs to be available when a user needs it and data can not disappear unexpectedly when engineers need it.
|
||
- It all doesn’t matter if the platform is not secure and we
|
||
are leaking credentials or secrets.
|
||
- When a user provides a set of preconditions in a form of CI/CD configuration, the result should be deterministic each time a pipeline runs, because otherwise the platform might not be trustworthy.
|
||
- If it is fast, simple to use and has a great UX it will serve our users well.
|
||
|
||
## Building things in Verify
|
||
|
||
### Measure before you optimize, and make data-informed decisions
|
||
|
||
It is very difficult to optimize something that you can not measure. How would you
|
||
know if you succeeded, or how significant the success was? If you are working on
|
||
a performance or reliability improvement, make sure that you measure things before
|
||
you optimize them.
|
||
|
||
The best way to measure stuff is to add a Prometheus metric. Counters, gauges, and
|
||
histograms are great ways to quickly get approximated results. Unfortunately this
|
||
is not the best way to measure tail latency. Prometheus metrics, especially histograms,
|
||
are usually approximations.
|
||
|
||
If you have to measure tail latency, like how slow something could be or how
|
||
large a request payload might be, consider adding custom application logs and
|
||
always use structured logging.
|
||
|
||
It's useful to use profiling and flamegraphs to understand what the code execution
|
||
path truly looks like!
|
||
|
||
### Strive for simple solutions, avoid clever solutions
|
||
|
||
It is sometimes tempting to use a clever solution to deliver something more
|
||
quickly. We want to avoid shipping clever code, because it is usually more
|
||
difficult to understand and maintain in the long term. Instead, we want to
|
||
focus on boring solutions that make it easier to evolve the codebase and keep the
|
||
contribution barrier low. We want to find solutions that are as simple as
|
||
possible.
|
||
|
||
### Do not confuse boring solutions with easy solutions
|
||
|
||
Boring solutions are sometimes confused with easy solutions. Very often the
|
||
opposite is true. An easy solution might not be simple - for example, a complex
|
||
new library can be included to add a very small functionality that otherwise
|
||
could be implemented quickly - it is easier to include this library than to
|
||
build this thing, but it would bring a lot of complexity into the product.
|
||
|
||
On the other hand, it is also possible to over-engineer a solution when a simple,
|
||
well tested, and well maintained library is available. In that case using the
|
||
library might make sense. We recognize that we are constantly balancing simple
|
||
and easy solutions, and that finding the right balance is important.
|
||
|
||
### "Simple" is not mutually exclusive with "flexible"
|
||
|
||
Building simple things does not mean that more advanced and flexible solutions
|
||
will not be available. A good example here is an expanding complexity of
|
||
writing `.gitlab-ci.yml` configuration. For example, you can use a simple
|
||
method to define an environment name:
|
||
|
||
```yaml
|
||
deploy:
|
||
environment: production
|
||
script: cap deploy
|
||
```
|
||
|
||
But the `environment` keyword can be also expanded into another level of
|
||
configuration that can offer more flexibility.
|
||
|
||
```yaml
|
||
deploy:
|
||
environment:
|
||
name: review/$CI_COMMIT_REF_SLUG
|
||
url: https://prod.example.com
|
||
script: cap deploy
|
||
```
|
||
|
||
This kind of approach shields new users from the complexities of the platform,
|
||
but still allows them to go deeper if they need to. This approach can be
|
||
applied to many other technical implementations.
|
||
|
||
### Make things observable
|
||
|
||
GitLab is a DevOps platform. We popularize DevOps because it helps companies
|
||
be more efficient and achieve better results. One important component of
|
||
DevOps culture is to take ownership over features and code that you are
|
||
building. It is very difficult to do that when you don’t know how your features
|
||
perform and behave in the production environment.
|
||
|
||
This is why we want to make our features and code observable. It
|
||
should be written in a way that an author can understand how well or how poorly
|
||
the feature or code behaves in the production environment. We usually accomplish
|
||
that by introducing the proper mix of Prometheus metrics and application
|
||
loggers.
|
||
|
||
**TODO** document when to use Prometheus metrics, when to use loggers. Write a
|
||
few sentences about histograms and counters. Write a few sentences highlighting
|
||
importance of metrics when doing incremental rollouts.
|
||
|
||
### Protect customer data
|
||
|
||
Making data produced by our CI/CD platform durable is important. We recognize that
|
||
data generated in the CI/CD by users and customers is
|
||
something important and we must protect it. This data is not only important
|
||
because it can contain important information, we also do have compliance and
|
||
auditing responsibilities.
|
||
|
||
Therefore we must take extra care when we are writing migrations
|
||
that permanently removes data from our database, or when we are define
|
||
new retention policies.
|
||
|
||
As a general rule, when you are writing code that is supposed to remove
|
||
data from the database, file system, or object storage, you should get an extra pair
|
||
of eyes on your changes. When you are defining a new retention policy, you
|
||
should double check with PMs and EMs.
|
||
|
||
### Get your changes reviewed
|
||
|
||
When your merge request is ready for reviews you must assign
|
||
reviewers and then maintainers. Depending on the complexity of a change, you
|
||
might want to involve the people that know the most about the codebase area you are
|
||
changing. We do have many domain experts in Verify and it is absolutely acceptable to
|
||
ask them to review your code when you are not certain if a reviewer or
|
||
maintainer assigned by the Reviewer Roulette has enough context about the
|
||
change.
|
||
|
||
The reviewer roulette offers useful suggestions, but as assigning the right
|
||
reviewers is important it should not be done automatically every time. It might
|
||
not make sense to assign someone who knows nothing about the area you are
|
||
updating, because their feedback might be limited to code style and syntax.
|
||
Depending on the complexity and impact of a change, assigning the right people
|
||
to review your changes might be very important.
|
||
|
||
If you don’t know who to assign, consult `git blame` or ask in the `#verify`
|
||
Slack channel (GitLab team members only).
|
||
|
||
### Incremental rollouts
|
||
|
||
After your merge request is merged by a maintainer, it is time to release it to
|
||
users and the wider community. We usually do this with feature flags.
|
||
While not every merge request needs a feature flag, most merge
|
||
requests in Verify should have [feature flags](https://about.gitlab.com/handbook/product-development-flow/feature-flag-lifecycle/#when-to-use-feature-flags).
|
||
|
||
If you already follow the advice on this page, you probably already have a
|
||
few metrics and perhaps a few loggers added that make your new code observable
|
||
in the production environment. You can now use these metrics to incrementally
|
||
roll out your changes!
|
||
|
||
A typical scenario involves enabling a few features in a few internal projects
|
||
while observing your metrics or loggers. Be aware that there might be a
|
||
small delay involved in ingesting logs in Elastic or Kibana. After you confirm
|
||
the feature works well with internal projects you can start an
|
||
incremental rollout for other projects.
|
||
|
||
Avoid using "percent of time" incremental rollouts. These are error prone,
|
||
especially when you are checking feature flags in a few places in the codebase
|
||
and you have not memoized the result of a check in a single place.
|
||
|
||
### Do not cause our Universe to implode
|
||
|
||
During one of the first GitLab Contributes events we had a discussion about the importance
|
||
of keeping CI/CD pipeline, stage, and job statuses accurate. We considered a hypothetical
|
||
scenario relating to a software being built by one of our [early customers](https://about.gitlab.com/blog/2016/11/23/gitlab-adoption-growing-at-cern/)
|
||
|
||
> What happens if software deployed to the [Large Hadron Collider (LHC)](https://en.wikipedia.org/wiki/Large_Hadron_Collider),
|
||
> breaks because of a bug in GitLab CI/CD that showed that a pipeline
|
||
> passed, but this data was not accurate and the software deployed was actually
|
||
> invalid? A problem like this could cause the LHC to malfunction, which
|
||
> could generate a new particle that would then cause the universe to implode.
|
||
|
||
That would be quite an undesirable outcome of a small bug in GitLab CI/CD status
|
||
processing. Please take extra care when you are working on CI/CD statuses,
|
||
we don’t want to implode our Universe!
|
||
|
||
This is an extreme and unlikely scenario, but presenting data that is not accurate
|
||
can potentially cause a myriad of problems through the
|
||
[butterfly effect](https://en.wikipedia.org/wiki/Butterfly_effect).
|
||
There are much more likely scenarios that
|
||
can have disastrous consequences. GitLab CI/CD is being used by companies
|
||
building medical, aviation, and automotive software. Continuous Integration is
|
||
a mission critical part of software engineering.
|
||
|
||
When you are working on a subsystem for pipeline processing and transitioning
|
||
CI/CD statuses, request an additional opinion on the design from a domain expert
|
||
as early as possible and hold others accountable for doing the same.
|