282 lines
11 KiB
Markdown
282 lines
11 KiB
Markdown
---
|
|
stage: Platforms
|
|
group: Scalability
|
|
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
|
|
---
|
|
|
|
# Rails request SLIs (service level indicators)
|
|
|
|
> [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4
|
|
|
|
NOTE:
|
|
This SLI is used for service monitoring. But not for [error budgets for stage groups](../stage_group_observability/index.md#error-budget)
|
|
by default. You can [opt in](#error-budget-attribution-and-ownership).
|
|
|
|
The request Apdex SLI and the error rate SLI are [SLIs defined in the application](index.md).
|
|
|
|
The request Apdex measures the duration of successful requests as an indicator for
|
|
application performance. This includes the REST and GraphQL API, and the
|
|
regular controller endpoints.
|
|
|
|
The error rate measures unsuccessful requests as an indicator for
|
|
server misbehavior. This includes the REST API, and the
|
|
regular controller endpoints.
|
|
|
|
1. `gitlab_sli_rails_request_apdex_total`: This counter gets
|
|
incremented for every request that did not result in a response
|
|
with a `5xx` status code. It ensures slow failures are not
|
|
counted twice, because the request is already counted in the error SLI.
|
|
|
|
1. `gitlab_sli_rails_request_apdex_success_total`: This counter gets
|
|
incremented for every successful request that performed faster than
|
|
the [defined target duration depending on the endpoint's urgency](#adjusting-request-urgency).
|
|
|
|
1. `gitlab_sli_rails_request_error_total`: This counter gets
|
|
incremented for every request that resulted in a response
|
|
with a `5xx` status code.
|
|
|
|
1. `gitlab_sli_rails_request_total`: This counter gets
|
|
incremented for every request.
|
|
|
|
These counters are labeled with:
|
|
|
|
1. `endpoint_id`: The identification of the Rails Controller or the
|
|
Grape-API endpoint.
|
|
|
|
1. `feature_category`: The feature category specified for that
|
|
controller or API endpoint.
|
|
|
|
## Request Apdex SLO
|
|
|
|
These counters can be combined into a success ratio. The objective for
|
|
this ratio is defined in the service catalog per service. For this SLI to meet SLO,
|
|
the ratio recorded must be higher than:
|
|
|
|
- [Web: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/web.jsonnet#L19)
|
|
- [API: 0.995](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/api.jsonnet#L19)
|
|
- [Git: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/git.jsonnet#L22)
|
|
|
|
For example: for the web-service, we want at least 99.8% of requests
|
|
to be faster than their target duration.
|
|
|
|
We use these targets for alerting and service monitoring. Set durations taking
|
|
these targets into account, so we don't cause alerts. The goal, however, is to
|
|
set the urgency to a target that satisfies our users.
|
|
|
|
Both successful measurements and unsuccessful ones affect the
|
|
error budget for stage groups.
|
|
|
|
## Adjusting request urgency
|
|
|
|
Not all endpoints perform the same type of work, so it is possible to
|
|
define different urgency levels for different endpoints. An endpoint with a
|
|
lower urgency can have a longer request duration than endpoints with high urgency.
|
|
|
|
Long-running requests are more expensive for our infrastructure. While serving
|
|
one request, the thread remains occupied for the duration of that request. The thread
|
|
can handle nothing else. Due to Ruby's Global VM Lock, the thread might keep the
|
|
lock and stall other requests handled by the same Puma worker
|
|
process. The request is, in fact, a noisy neighbor for other requests
|
|
handled by the worker. We cap the upper bound for a target duration at 5 seconds
|
|
for this reason.
|
|
|
|
## Decreasing the urgency (setting a higher target duration)
|
|
|
|
You can decrease the urgency on an existing endpoint on
|
|
a case-by-case basis. Take the following into account:
|
|
|
|
1. Apdex is about perceived performance. If a user is actively waiting
|
|
for the result of a request, waiting 5 seconds might not be
|
|
acceptable. However, if the endpoint is used by an automation
|
|
requiring a lot of data, 5 seconds could be acceptable.
|
|
|
|
A product manager can help to identify how an endpoint is used.
|
|
|
|
1. The workload for some endpoints can sometimes differ greatly
|
|
depending on the parameters specified by the caller. The urgency
|
|
needs to accommodate those differences. In some cases, you could
|
|
define a separate [application SLI](index.md#defining-a-new-sli)
|
|
for what the endpoint is doing.
|
|
|
|
When the endpoints in certain cases turn into no-ops, making them
|
|
very fast, we should ignore these fast requests when setting the
|
|
target. For example, if the `MergeRequests::DraftsController` is
|
|
hit for every merge request being viewed, but rarely renders
|
|
anything, then we should pick the target that
|
|
would still accommodate the endpoint performing work.
|
|
|
|
1. Consider the dependent resources consumed by the endpoint. If the endpoint
|
|
loads a lot of data from Gitaly or the database, and this causes
|
|
unsatisfactory performance, consider optimizing the
|
|
way the data is loaded rather than increasing the target duration
|
|
by lowering the urgency.
|
|
|
|
In these cases, it might be appropriate to temporarily decrease
|
|
urgency to make the endpoint meet SLO, if this is bearable for the
|
|
infrastructure. In such cases, create a code comment linking to an issue.
|
|
|
|
If the endpoint consumes a lot of CPU time, we should also consider
|
|
this: these kinds of requests are the kind of noisy neighbors we
|
|
should try to keep as short as possible.
|
|
|
|
1. Traffic characteristics should also be taken into account. If the
|
|
traffic to the endpoint sometimes bursts, like CI traffic spinning up a
|
|
big batch of jobs hitting the same endpoint, then having these
|
|
endpoints take five seconds is unacceptable from an infrastructure point of
|
|
view. We cannot scale up the fleet fast enough to accommodate for
|
|
the incoming slow requests alongside the regular traffic.
|
|
|
|
When lowering the urgency for an existing endpoint, please involve a
|
|
[Scalability team member](https://about.gitlab.com/handbook/engineering/infrastructure/team/scalability/#team-members)
|
|
in the review. We can use request rates and durations available in the
|
|
logs to come up with a recommendation. You can pick a threshold
|
|
using the same process as for
|
|
[increasing urgency](#increasing-urgency-setting-a-lower-target-duration),
|
|
picking a duration that is higher than the SLO for the service.
|
|
|
|
We shouldn't set the longest durations on endpoints in the merge
|
|
requests that introduces them, because we don't yet have data to support
|
|
the decision.
|
|
|
|
## Increasing urgency (setting a lower target duration)
|
|
|
|
When increasing the urgency, we must make sure the endpoint
|
|
still meets SLO for the fleet that handles the request. You can use the
|
|
information in the logs to check:
|
|
|
|
1. Open [this table in Kibana](https://log.gprd.gitlab.net/goto/bbb6465c68eb83642269e64a467df3df)
|
|
|
|
1. The table loads information for the busiest endpoints by
|
|
default. To speed the response, add both:
|
|
|
|
- A filter for `json.meta.caller_id.keyword`.
|
|
- The identifier you're interested in, for example:
|
|
|
|
```ruby
|
|
Projects::RawController#show
|
|
```
|
|
|
|
or:
|
|
|
|
```plaintext
|
|
GET /api/:version/projects/:id/snippets/:snippet_id/raw
|
|
```
|
|
|
|
1. Check the [appropriate percentile duration](#request-apdex-slo) for
|
|
the service handling the endpoint. The overall duration should
|
|
be lower than your intended target.
|
|
|
|
1. If the overall duration is below the intended target, check the peaks over time
|
|
in [this graph](https://log.gprd.gitlab.net/goto/9319c4a402461d204d13f3a4924a89fc)
|
|
in Kibana. Here, the percentile in question should not peak above
|
|
the target duration we want to set.
|
|
|
|
As decreasing a threshold too much could result in alerts for the
|
|
Apdex degradation, please also involve a Scalability team member in
|
|
the merge request.
|
|
|
|
## How to adjust the urgency
|
|
|
|
You can specify urgency similar to how endpoints
|
|
[get a feature category](../feature_categorization/index.md). Endpoints without a
|
|
specific target use the default urgency: 1s duration. These configurations
|
|
are available:
|
|
|
|
| Urgency | Duration in seconds | Notes |
|
|
|------------|---------------------|-----------------------------------------------|
|
|
| `:high` | 0.25s | |
|
|
| `:medium` | 0.5s | |
|
|
| `:default` | 1s | The default when nothing is specified. |
|
|
| `:low` | 5s | |
|
|
|
|
### Rails controller
|
|
|
|
An urgency can be specified for all actions in a controller:
|
|
|
|
```ruby
|
|
class Boards::ListsController < ApplicationController
|
|
urgency :high
|
|
end
|
|
```
|
|
|
|
To also specify the urgency for certain actions in a controller:
|
|
|
|
```ruby
|
|
class Boards::ListsController < ApplicationController
|
|
urgency :high, [:index, :show]
|
|
end
|
|
```
|
|
|
|
A custom RSpec matcher is available to check endpoint's request urgency in the controller specs:
|
|
|
|
```ruby
|
|
specify do
|
|
expect(get(:index, params: request_params)).to have_request_urgency(:medium)
|
|
end
|
|
```
|
|
|
|
### Grape endpoints
|
|
|
|
To specify the urgency for an entire API class:
|
|
|
|
```ruby
|
|
module API
|
|
class Issues < ::API::Base
|
|
urgency :low
|
|
end
|
|
end
|
|
```
|
|
|
|
To specify the urgency also for certain actions in a API class:
|
|
|
|
```ruby
|
|
module API
|
|
class Issues < ::API::Base
|
|
urgency :medium, [
|
|
'/groups/:id/issues',
|
|
'/groups/:id/issues_statistics'
|
|
]
|
|
end
|
|
end
|
|
```
|
|
|
|
Or, we can specify the urgency per endpoint:
|
|
|
|
```ruby
|
|
get 'client/features', urgency: :low do
|
|
# endpoint logic
|
|
end
|
|
```
|
|
|
|
A custom RSpec matcher is also compatible with grape endpoints' specs:
|
|
|
|
```ruby
|
|
|
|
specify do
|
|
expect(get(api('/avatar'), params: { email: 'public@example.com' })).to have_request_urgency(:medium)
|
|
end
|
|
```
|
|
|
|
WARNING:
|
|
We can't specify the urgency at the namespace level. The directive is ignored when doing so.
|
|
|
|
### Error budget attribution and ownership
|
|
|
|
This SLI is used for service level monitoring. It feeds into the
|
|
[error budget for stage groups](../stage_group_observability/index.md#error-budget).
|
|
|
|
For more information, read the epic for
|
|
[defining custom SLIs and incorporating them into error budgets](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525)).
|
|
The endpoints for the SLI feed into a group's error budget based on the
|
|
[feature category declared on it](../feature_categorization/index.md).
|
|
|
|
To know which endpoints are included for your group, you can see the
|
|
request rates on the
|
|
[group dashboard for your group](https://dashboards.gitlab.net/dashboards/f/stage-groups/stage-groups).
|
|
In the **Budget Attribution** row, the **Puma Apdex** log link shows you
|
|
how many requests are not meeting a 1s or 5s target.
|
|
|
|
For more information about the content of the dashboard, see
|
|
[Dashboards for stage groups](../stage_group_observability/index.md). For more information
|
|
about our exploration of the error budget itself, see
|
|
[issue 1365](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1365).
|