396 lines
20 KiB
Markdown
396 lines
20 KiB
Markdown
---
|
|
status: accepted
|
|
creation-date: "2022-09-08"
|
|
authors: [ "@grzesiek", "@marshall007", "@fabiopitino", "@hswimelar" ]
|
|
coach: "@andrewn"
|
|
approvers: [ "@sgoldstein" ]
|
|
owning-stage: "~devops::enablement"
|
|
participating-stages: []
|
|
---
|
|
|
|
# Next Rate Limiting Architecture
|
|
|
|
## Summary
|
|
|
|
Introducing reasonable application limits is a very important step in any SaaS
|
|
platform scaling strategy. The more users a SaaS platform has, the more
|
|
important it is to introduce sensible rate limiting and policies enforcement
|
|
that will help to achieve availability goals, reduce the problem of noisy
|
|
neighbours for users and ensure that they can keep using a platform
|
|
successfully.
|
|
|
|
This is especially true for GitLab.com. Our goal is to have a reasonable and
|
|
transparent strategy for enforcing application limits, which will become a
|
|
definition of a responsible usage, to help us with keeping our availability and
|
|
user satisfaction at a desired level.
|
|
|
|
We've been introducing various application limits for many years already, but
|
|
we've never had a consistent strategy for doing it. What we want to build now is
|
|
a consistent framework used by engineers and product managers, across entire
|
|
application stack, to define, expose and enforce limits and policies.
|
|
|
|
Lack of consistency in defining limits, not being able to expose them to our
|
|
users, support engineers and satellite services, has negative impact on our
|
|
productivity, makes it difficult to introduce new limits and eventually
|
|
prevents us from enforcing responsible usage on all layers of our application
|
|
stack.
|
|
|
|
This blueprint has been written to consolidate our limits and to describe the
|
|
vision of our next rate limiting and policies enforcement architecture.
|
|
|
|
## Goals
|
|
|
|
**Implement a next architecture for rate limiting and policies definition.**
|
|
|
|
## Challenges
|
|
|
|
- We have many ways to define application limits, in many different places.
|
|
- It is difficult to understand what limits have been applied to a request.
|
|
- It is difficult to introduce new limits, even more to define policies.
|
|
- Finding what limits are defined requires performing a codebase audit.
|
|
- We don't have a good way to expose limits to satellite services like Registry.
|
|
- We enforce a number of different policies via opaque external systems
|
|
(Pipeline Validation Service, Bouncer, Watchtower, Cloudflare, HAProxy).
|
|
- There is not standardized way to define policies in a way consistent with defining limits.
|
|
- It is difficult to understand when a user is approaching a limit threshold.
|
|
- There is no way to automatically notify a user when they are approaching thresholds.
|
|
- There is no single way to change limits for a namespace / project / user / customer.
|
|
- There is no single way to monitor limits through real-time metrics.
|
|
- There is no framework for hierarchical limit configuration (instance / namespace / subgroup / project).
|
|
- We allow disabling rate-limiting for some marquee SaaS customers, but this
|
|
increases a risk for those same customers. We should instead be able to set
|
|
higher limits.
|
|
|
|
## Opportunity
|
|
|
|
We want to build a new framework, making it easier to define limits, quotas and
|
|
policies, and to enforce / adjust them in a controlled way, through robust
|
|
monitoring capabilities.
|
|
|
|
<!-- markdownlint-disable MD029 -->
|
|
|
|
1. Build a framework to define and enforce limits in GitLab Rails.
|
|
2. Build an API to consume limits in satellite service and expose them to users.
|
|
3. Extract parts of this framework into a dedicated GitLab Limits Service.
|
|
|
|
<!-- markdownlint-enable MD029 -->
|
|
|
|
The most important opportunity here is consolidation happening on multiple
|
|
levels:
|
|
|
|
1. Consolidate on the application limits tooling used in GitLab Rails.
|
|
1. Consolidate on the process of adding and managing application limits.
|
|
1. Consolidate on the behavior of hierarchical cascade of limits and overrides.
|
|
1. Consolidate on the application limits tooling used across entire application stack.
|
|
1. Consolidate on the policies enforcement tooling used across entire company.
|
|
|
|
Once we do that we will unlock another opportunity: to ship the new framework /
|
|
tooling as a GitLab feature to unlock these consolidation benefits for our
|
|
users, customers and entire wider community audience.
|
|
|
|
### Limits, quotas and policies
|
|
|
|
This document aims to describe our technical vision for building the next rate
|
|
limiting architecture for GitLab.com. We refer to this architectural evolution
|
|
as "the next rate limiting architecture", but this is a mental shortcut,
|
|
because we actually want to build a better framework that will make it easier
|
|
for us to manage not only rate limits, but also quotas and policies.
|
|
|
|
Below you can find a short definition of what we understand by a limit, by a
|
|
quota and by a policy.
|
|
|
|
- **Limit:** A constraint on application usage, typically used to mitigate
|
|
risks to performance, stability, and security.
|
|
- _Example:_ API calls per second for a given IP address
|
|
- _Example:_ `git clone` events per minute for a given user
|
|
- _Example:_ maximum artifact upload size of 1 GB
|
|
- **Quota:** A global constraint in application usage that is aggregated across an
|
|
entire namespace over the duration of their billing cycle.
|
|
- _Example:_ 400 CI/CD minutes per namespace per month
|
|
- _Example:_ 10 GB transfer per namespace per month
|
|
- **Policy:** A representation of business logic that is decoupled from application
|
|
code. Decoupled policy definitions allow logic to be shared across multiple services
|
|
and/or "hot-loaded" at runtime without releasing a new version of the application.
|
|
- _Example:_ decode and verify a JWT, determine whether the user has access to the
|
|
given resource based on the JWT scopes and claims
|
|
- _Example:_ deny access based on group-level constraints
|
|
(such as IP allowlist, SSO, and 2FA) across all services
|
|
|
|
Technically, all of these are limits, because rate limiting is still
|
|
"limiting", quota is usually a business limit, and policy limits what you can
|
|
do with the application to enforce specific rules. By referring to a "limit" in
|
|
this document we mean a limit that is defined to protect business, availability
|
|
and security.
|
|
|
|
### Framework to define and enforce limits
|
|
|
|
First we want to build a new framework that will allow us to define and enforce
|
|
application limits, in the GitLab Rails project context, in a more consistent
|
|
and established way. In order to do that, we will need to build a new
|
|
abstraction that will tell engineers how to define a limit in a structured way
|
|
(presumably using YAML or Cue format) and then how to consume the limit in the
|
|
application itself.
|
|
|
|
We already do have many limits defined in the application, we can use them to
|
|
triangulate to find a reasonable abstraction that will consolidate how we
|
|
define, use and enforce limits.
|
|
|
|
We envision building a simple Ruby library here (we can add it to LabKit) that
|
|
will make it trivial for engineers to check if a certain limit has been
|
|
exceeded or not.
|
|
|
|
```yaml
|
|
name: my_limit_name
|
|
actors: user
|
|
context: project, group, pipeline
|
|
type: rate / second
|
|
group: pipeline::execution
|
|
limits:
|
|
warn: 2B / day
|
|
soft: 100k / s
|
|
hard: 500k / s
|
|
```
|
|
|
|
```ruby
|
|
Gitlab::Limits::RateThreshold.enforce(:my_limit_name) do |threshold|
|
|
actor = current_user
|
|
context = current_project
|
|
|
|
threshold.available do |limit|
|
|
# ...
|
|
end
|
|
|
|
threshold.approaching do |limit|
|
|
# ...
|
|
end
|
|
|
|
threshold.exceeded do |limit|
|
|
# ...
|
|
end
|
|
end
|
|
```
|
|
|
|
In the example above, when `my_limit_name` is defined in YAML, engineers will
|
|
be check the current state and execute appropriate code block depending on the
|
|
past usage / resource consumption.
|
|
|
|
Things we want to build and support by default:
|
|
|
|
1. Comprehensive dashboards showing how often limits are being hit.
|
|
1. Notifications about the risk of hitting limits.
|
|
1. Automation checking if limits definitions are being enforced properly.
|
|
1. Different types of limits - time bound / number per resource etc.
|
|
1. A panel that makes it easy to override limits per plan / namespace.
|
|
1. Logging that will expose limits applied in Kibana.
|
|
1. An automatically generated documentation page describing all the limits.
|
|
|
|
### API to expose limits and policies
|
|
|
|
Once we have an established a consistent way to define application limits we
|
|
can build a few API endpoints that will allow us to expose them to our users,
|
|
customers and other satellite services that may want to consume them.
|
|
|
|
Users will be able to ask the API about the limits / thresholds that have been
|
|
set for them, how often they are hitting them, and what impact those might have
|
|
on their business. This kind of transparency can help them with communicating
|
|
their needs to customer success team at GitLab, and we will be able to
|
|
communicate how the responsible usage is defined at a given moment.
|
|
|
|
Because of how GitLab architecture has been built, GitLab Rails application, in
|
|
most cases, behaves as a central enterprise service bus (ESB) and there are a
|
|
few satellite services communicating with it. Services like Container Registry,
|
|
GitLab Runners, Gitaly, Workhorse, KAS could use the API to receive a set of
|
|
application limits those are supposed to enforce. This will still allow us to
|
|
define all of them in a single place.
|
|
|
|
We should, however, avoid the possible negative-feedback-loop, that will put
|
|
additional strain on the Rails application when there is a sudden increase in
|
|
usage happening. This might be a big customer starting a new automation that
|
|
traverses our API or a Denial of Service attack. In such cases, the additional
|
|
traffic will reach GitLab Rails and subsequently also other satellite services.
|
|
Then the satellite services may need to consult Rails again to obtain new
|
|
instructions / policies around rate limiting the increased traffic. This can
|
|
put additional strain on Rails application and eventually degrade performance
|
|
even more. In order to avoid this problem, we should extract the API endpoints
|
|
to separate service (see the section below) if the request rate to those
|
|
endpoints depends on the volume of incoming traffic. Alternatively we can keep
|
|
those endpoints in Rails if the increased traffic will not translate into
|
|
increase of requests rate or increase in resources consumption on these API
|
|
endpoints on the Rails side.
|
|
|
|
#### Decoupled Limits Service
|
|
|
|
At some point we may decide that it is time to extract a stateful backend
|
|
responsible for storing metadata around limits, all the counters and state
|
|
required, and exposing API, out of Rails.
|
|
|
|
It is impossible to make a decision about extracting such a decoupled limits
|
|
service yet, because we will need to ship more proof-of-concept work, and
|
|
concrete iterations to inform us better about when and how we should do that. We
|
|
will depend on the Evolution Architecture practice to guide us towards either
|
|
extracting Decoupled Limits Service or not doing that at all.
|
|
|
|
As we evolve this blueprint, we will document our findings and insights about
|
|
how this service should look like, in this section of the document.
|
|
|
|
### GitLab Policy Service
|
|
|
|
_Disclaimer_: Extracting a GitLab Policy Service might be out of scope
|
|
of the current workstream organized around implementing this blueprint.
|
|
|
|
Not all limits can be easily described in YAML. There are some more complex
|
|
policies that require a bit more sophisticated approach and a declarative
|
|
programming language used to enforce them. One example of such a language might be
|
|
[Rego](https://www.openpolicyagent.org/docs/latest/policy-language/) language.
|
|
It is a standardized way to define policies in
|
|
[OPA - Open Policy Agent](https://www.openpolicyagent.org/). At GitLab we are
|
|
already using OPA in some departments. We envision the need to additional
|
|
consolidation to not only consolidate on the tooling we are using internally at
|
|
GitLab, but to also transform the Next Rate Limiting Architecture into
|
|
something we can make a part of the product itself.
|
|
|
|
Today, we already do have a policy service we are using to decide whether a
|
|
pipeline can be created or not. There are many policies defined in
|
|
[Pipeline Validation Service](https://gitlab.com/gitlab-org/modelops/anti-abuse/pipeline-validation-service).
|
|
There is a significant opportunity here in transforming Pipeline Validation
|
|
Service into a general purpose GitLab Policy Service / GitLab Policy Agent that
|
|
will be well integrated into the GitLab product itself.
|
|
|
|
Generalizing Pipeline Validation Service into GitLab Policy Service can bring a
|
|
few interesting benefits:
|
|
|
|
1. Consolidate on our tooling across the company to improve efficiency.
|
|
1. Integrate our GitLab Rails limits framework to resolve policies using the policy service.
|
|
1. Do not struggle to define complex policies in YAML and hack evaluating them in Ruby.
|
|
1. Build a policy for GraphQL queries limiting using query execution cost estimation.
|
|
1. Make it easier to resolve policies that do not need "hierarchical limits" structure.
|
|
1. Make GitLab Policy Service part of the product and integrate it into the single application.
|
|
|
|
We envision using GitLab Policy Service to be place to define policies that do
|
|
not require knowing anything about the hierarchical structure of the limits.
|
|
There are limits that do not need this, like IP addresses allow-list, spam
|
|
checks, configuration validation etc.
|
|
|
|
We defined "Policy" as a stateless, functional-style, limit. It takes input
|
|
arguments and evaluates to either true or false. It should not require a global
|
|
counter or any other volatile global state to get evaluated. It may still
|
|
require to have a globally defined rules / configuration, but this state is not
|
|
volatile in a same way a rate limiting counter may be, or a megabytes consumed
|
|
to evaluate quota limit.
|
|
|
|
#### Policies used internally and externally
|
|
|
|
The GitLab Policy Service might be used in two different ways:
|
|
|
|
1. Rails limits framework will use it as a source of policies enforced internally.
|
|
1. The policy service feature will be used as a backend to store policies defined by users.
|
|
|
|
These are two slightly different use-cases: first one is about using
|
|
internally-defined policies to ensure the stability / availability of a GitLab
|
|
instance (GitLab.com or self-managed instance). The second use-case is about
|
|
making GitLab Policy Service a feature that users will be able to build on top
|
|
of.
|
|
|
|
Both use-cases are valid but we will need to make technical decision about how
|
|
to separate them. Even if we decide to implement them both in a single service,
|
|
we will need to draw a strong boundary between the two.
|
|
|
|
The same principle might apply to Decouple Limits Service described in one of
|
|
the sections of this document above.
|
|
|
|
#### The two limits / policy services
|
|
|
|
It is possible that GitLab Policy Service and Decoupled Limits Service can
|
|
actually be the same thing. It, however, depends on the implementation details
|
|
that we can't predict yet, and the decision about merging these services
|
|
together will need to be informed by subsequent iterations' feedback.
|
|
|
|
## Hierarchical limits
|
|
|
|
GitLab application aggregates users, projects, groups and namespaces in a
|
|
hierarchical way. This hierarchical structure has been designed to make it
|
|
easier to manage permissions, streamline workflows, and allow users and
|
|
customers to store related projects, repositories, and other artifacts,
|
|
together.
|
|
|
|
It is important to design the new rate limiting framework in a way that it
|
|
built on top of this hierarchical structure and engineers, customers, SREs and
|
|
other stakeholders can understand how limits are being applied, enforced and
|
|
overridden within the hierarchy of namespaces, groups and projects.
|
|
|
|
We want to reduce the cognitive load required to understand how limits are
|
|
being managed within the existing permissions structure. We might need to build
|
|
a simple and easy-to-understand formula for how our application decides which
|
|
limits and thresholds to apply for a given request and a given actor:
|
|
|
|
> GitLab will read default limits for every operation, all overrides configured
|
|
> and will choose a limit with the highest precedence configured. A limit
|
|
> precedence needs to be explicitly configured for every override, a default
|
|
> limit has precedence 100.
|
|
|
|
One way in which we can simplify limits management in general is to:
|
|
|
|
1. Have default limits / thresholds defined in YAML files with a default precedence 100.
|
|
1. Allow limits to be overridden through the API, store overrides in the database.
|
|
1. Every limit / threshold override needs to have an integer precedence value provided.
|
|
1. Build an API that will take an actor and expose limits applicable for it.
|
|
1. Build a dashboard showing actors with non-standard limits / overrides.
|
|
1. Build a observability around this showing in Kibana when non-standard limits are being used.
|
|
|
|
The points above represent an idea to use precedence score (or Z-Index for
|
|
limits), but there may be better solutions, like just defining a direction of
|
|
overrides - a lower limit might always override a limit defined higher in the
|
|
hierarchy. Choosing a proper solution will require a thoughtful research.
|
|
|
|
## Principles
|
|
|
|
1. Try to avoid building rate limiting framework in a tightly coupled way.
|
|
1. Build application limits API in a way that it can be easily extracted to a separate service.
|
|
1. Build application limits definition in a way that is independent from the Rails application.
|
|
1. Build tooling that produce consistent behavior and results across programming languages.
|
|
1. Build the new framework in a way that we can extend to allow self-managed administrators to customize limits.
|
|
1. Maintain consistent features and behavior across SaaS and self-managed codebase.
|
|
1. Be mindful about a cognitive load added by the hierarchical limits, aim to reduce it.
|
|
|
|
## Phases and iterations
|
|
|
|
1. **Compile examples of current most important application limits (Owning Team)**
|
|
- Owning Team (in collaboration with Stage Groups) compiles a list of the
|
|
most important application limits used in Rails today.
|
|
|
|
1. **Implement Rate Limiting Framework in Rails (Owning Team)**
|
|
- Triangulate rate limiting abstractions based on the data gathered in Phase 1.
|
|
- Develop YAML model for limits.
|
|
- Build Rails SDK.
|
|
- Create examples showcasing usage of the new rate limits SDK.
|
|
|
|
1. **Team fan out of Rails SDK (Stage Groups)**
|
|
- Individual stage groups begin using the SDK built in Phase 2 for new limit and policies.
|
|
- Stage groups begin replacing historical ad hoc limit implementations with the SDK.
|
|
- (Owning team) Provides means to monitor and observe the progress of the replacement effort. Ideally this is broken down to the `feature_category` level to drive group-level buy-in.
|
|
|
|
1. **Enable Satellite Services to Use the Rate Limiting Framework (Owning Team)**
|
|
- Determine if the goals of Phase 4 are best met by either:
|
|
- Extracting the Rails rate limiting service into a decoupled service.
|
|
- Implementing a separate Go library which uses the same backend (for example, Redis) for rate limiting.
|
|
|
|
1. **SDK for Satellite Services (Owning Team)**
|
|
- Build Go SDK.
|
|
- Create examples showcasing usage of the new rate limits SDK.
|
|
|
|
1. **Team fan out for Satellite Services (Stage Groups)**
|
|
- Individual stage groups begin using the SDK built in Phase 5 for new limit and policies.
|
|
- Stage groups begin replacing historical ad hoc limit implementations with the SDK.
|
|
|
|
## Status
|
|
|
|
Request For Comments.
|
|
|
|
## Timeline
|
|
|
|
- 2022-04-27: [Rate Limit Architecture Working Group](https://about.gitlab.com/company/team/structure/working-groups/rate-limit-architecture/) started.
|
|
- 2022-06-07: Working Group members [started submitting technical proposals](https://gitlab.com/gitlab-org/gitlab/-/issues/364524) for the next rate limiting architecture.
|
|
- 2022-06-15: We started [scoring proposals](https://docs.google.com/spreadsheets/d/1DFHU1kSdTnpydwM5P2RK8NhVBNWgEHvzT72eOhB8F9E) submitted by Working Group members.
|
|
- 2022-07-06: A fourth, [consolidated proposal](https://gitlab.com/gitlab-org/gitlab/-/issues/364524#note_1017640650), has been submitted.
|
|
- 2022-07-12: Started working on the design document following [Architecture Evolution Workflow](https://about.gitlab.com/handbook/engineering/architecture/workflow/).
|
|
- 2022-09-08: The initial version of the blueprint has been merged.
|