debian-mirror-gitlab/doc/development/application_slis/rails_request.md

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

283 lines
11 KiB
Markdown
Raw Normal View History

2021-11-18 22:05:49 +05:30
---
stage: Platforms
group: Scalability
2022-11-25 23:54:43 +05:30
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
2021-11-18 22:05:49 +05:30
---
2023-05-27 22:25:52 +05:30
# Rails request SLIs (service level indicators)
2021-11-18 22:05:49 +05:30
> [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4
NOTE:
2022-06-21 17:19:12 +05:30
This SLI is used for service monitoring. But not for [error budgets for stage groups](../stage_group_observability/index.md#error-budget)
2021-12-11 22:18:48 +05:30
by default. You can [opt in](#error-budget-attribution-and-ownership).
2021-11-18 22:05:49 +05:30
2023-05-27 22:25:52 +05:30
The request Apdex SLI and the error rate SLI are [SLIs defined in the application](index.md).
The request Apdex measures the duration of successful requests as an indicator for
2021-11-18 22:05:49 +05:30
application performance. This includes the REST and GraphQL API, and the
2023-05-27 22:25:52 +05:30
regular controller endpoints.
The error rate measures unsuccessful requests as an indicator for
server misbehavior. This includes the REST API, and the
regular controller endpoints.
2021-11-18 22:05:49 +05:30
2023-05-27 22:25:52 +05:30
1. `gitlab_sli_rails_request_apdex_total`: This counter gets
2021-11-18 22:05:49 +05:30
incremented for every request that did not result in a response
2021-12-11 22:18:48 +05:30
with a `5xx` status code. It ensures slow failures are not
counted twice, because the request is already counted in the error SLI.
2021-11-18 22:05:49 +05:30
2023-05-27 22:25:52 +05:30
1. `gitlab_sli_rails_request_apdex_success_total`: This counter gets
2021-11-18 22:05:49 +05:30
incremented for every successful request that performed faster than
2021-12-11 22:18:48 +05:30
the [defined target duration depending on the endpoint's urgency](#adjusting-request-urgency).
2021-11-18 22:05:49 +05:30
2023-05-27 22:25:52 +05:30
1. `gitlab_sli_rails_request_error_total`: This counter gets
incremented for every request that resulted in a response
with a `5xx` status code.
1. `gitlab_sli_rails_request_total`: This counter gets
incremented for every request.
These counters are labeled with:
2021-11-18 22:05:49 +05:30
1. `endpoint_id`: The identification of the Rails Controller or the
2021-12-11 22:18:48 +05:30
Grape-API endpoint.
2021-11-18 22:05:49 +05:30
1. `feature_category`: The feature category specified for that
controller or API endpoint.
## Request Apdex SLO
2021-12-11 22:18:48 +05:30
These counters can be combined into a success ratio. The objective for
this ratio is defined in the service catalog per service. For this SLI to meet SLO,
the ratio recorded must be higher than:
2021-11-18 22:05:49 +05:30
2021-12-11 22:18:48 +05:30
- [Web: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/web.jsonnet#L19)
- [API: 0.995](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/api.jsonnet#L19)
- [Git: 0.998](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/services/git.jsonnet#L22)
2021-11-18 22:05:49 +05:30
For example: for the web-service, we want at least 99.8% of requests
to be faster than their target duration.
2021-12-11 22:18:48 +05:30
We use these targets for alerting and service monitoring. Set durations taking
these targets into account, so we don't cause alerts. The goal, however, is to
set the urgency to a target that satisfies our users.
2021-11-18 22:05:49 +05:30
2021-12-11 22:18:48 +05:30
Both successful measurements and unsuccessful ones affect the
2021-11-18 22:05:49 +05:30
error budget for stage groups.
## Adjusting request urgency
Not all endpoints perform the same type of work, so it is possible to
2021-12-11 22:18:48 +05:30
define different urgency levels for different endpoints. An endpoint with a
lower urgency can have a longer request duration than endpoints with high urgency.
Long-running requests are more expensive for our infrastructure. While serving
one request, the thread remains occupied for the duration of that request. The thread
can handle nothing else. Due to Ruby's Global VM Lock, the thread might keep the
2021-11-18 22:05:49 +05:30
lock and stall other requests handled by the same Puma worker
2021-12-11 22:18:48 +05:30
process. The request is, in fact, a noisy neighbor for other requests
handled by the worker. We cap the upper bound for a target duration at 5 seconds
for this reason.
2021-11-18 22:05:49 +05:30
## Decreasing the urgency (setting a higher target duration)
2021-12-11 22:18:48 +05:30
You can decrease the urgency on an existing endpoint on
a case-by-case basis. Take the following into account:
2021-11-18 22:05:49 +05:30
2021-12-11 22:18:48 +05:30
1. Apdex is about perceived performance. If a user is actively waiting
2021-11-18 22:05:49 +05:30
for the result of a request, waiting 5 seconds might not be
2021-12-11 22:18:48 +05:30
acceptable. However, if the endpoint is used by an automation
requiring a lot of data, 5 seconds could be acceptable.
2021-11-18 22:05:49 +05:30
A product manager can help to identify how an endpoint is used.
1. The workload for some endpoints can sometimes differ greatly
depending on the parameters specified by the caller. The urgency
2021-12-11 22:18:48 +05:30
needs to accommodate those differences. In some cases, you could
2021-11-18 22:05:49 +05:30
define a separate [application SLI](index.md#defining-a-new-sli)
for what the endpoint is doing.
When the endpoints in certain cases turn into no-ops, making them
very fast, we should ignore these fast requests when setting the
target. For example, if the `MergeRequests::DraftsController` is
2021-12-11 22:18:48 +05:30
hit for every merge request being viewed, but rarely renders
anything, then we should pick the target that
would still accommodate the endpoint performing work.
2021-11-18 22:05:49 +05:30
1. Consider the dependent resources consumed by the endpoint. If the endpoint
2021-12-11 22:18:48 +05:30
loads a lot of data from Gitaly or the database, and this causes
unsatisfactory performance, consider optimizing the
2021-11-18 22:05:49 +05:30
way the data is loaded rather than increasing the target duration
by lowering the urgency.
2021-12-11 22:18:48 +05:30
In these cases, it might be appropriate to temporarily decrease
2021-11-18 22:05:49 +05:30
urgency to make the endpoint meet SLO, if this is bearable for the
2021-12-11 22:18:48 +05:30
infrastructure. In such cases, create a code comment linking to an issue.
2021-11-18 22:05:49 +05:30
If the endpoint consumes a lot of CPU time, we should also consider
this: these kinds of requests are the kind of noisy neighbors we
should try to keep as short as possible.
2021-12-11 22:18:48 +05:30
1. Traffic characteristics should also be taken into account. If the
2023-03-04 22:38:38 +05:30
traffic to the endpoint sometimes bursts, like CI traffic spinning up a
2021-11-18 22:05:49 +05:30
big batch of jobs hitting the same endpoint, then having these
2021-12-11 22:18:48 +05:30
endpoints take five seconds is unacceptable from an infrastructure point of
view. We cannot scale up the fleet fast enough to accommodate for
2021-11-18 22:05:49 +05:30
the incoming slow requests alongside the regular traffic.
When lowering the urgency for an existing endpoint, please involve a
[Scalability team member](https://about.gitlab.com/handbook/engineering/infrastructure/team/scalability/#team-members)
in the review. We can use request rates and durations available in the
2021-12-11 22:18:48 +05:30
logs to come up with a recommendation. You can pick a threshold
using the same process as for
[increasing urgency](#increasing-urgency-setting-a-lower-target-duration),
picking a duration that is higher than the SLO for the service.
2021-11-18 22:05:49 +05:30
We shouldn't set the longest durations on endpoints in the merge
2021-12-11 22:18:48 +05:30
requests that introduces them, because we don't yet have data to support
2021-11-18 22:05:49 +05:30
the decision.
## Increasing urgency (setting a lower target duration)
2021-12-11 22:18:48 +05:30
When increasing the urgency, we must make sure the endpoint
2021-11-18 22:05:49 +05:30
still meets SLO for the fleet that handles the request. You can use the
2021-12-11 22:18:48 +05:30
information in the logs to check:
2021-11-18 22:05:49 +05:30
2021-12-11 22:18:48 +05:30
1. Open [this table in Kibana](https://log.gprd.gitlab.net/goto/bbb6465c68eb83642269e64a467df3df)
2021-11-18 22:05:49 +05:30
1. The table loads information for the busiest endpoints by
2021-12-11 22:18:48 +05:30
default. To speed the response, add both:
2022-07-16 23:28:13 +05:30
- A filter for `json.meta.caller_id.keyword`.
- The identifier you're interested in, for example:
```ruby
Projects::RawController#show
```
or:
```plaintext
GET /api/:version/projects/:id/snippets/:snippet_id/raw
```
2021-11-18 22:05:49 +05:30
1. Check the [appropriate percentile duration](#request-apdex-slo) for
2021-12-11 22:18:48 +05:30
the service handling the endpoint. The overall duration should
be lower than your intended target.
2021-11-18 22:05:49 +05:30
2021-12-11 22:18:48 +05:30
1. If the overall duration is below the intended target, check the peaks over time
in [this graph](https://log.gprd.gitlab.net/goto/9319c4a402461d204d13f3a4924a89fc)
2021-11-18 22:05:49 +05:30
in Kibana. Here, the percentile in question should not peak above
the target duration we want to set.
2021-12-11 22:18:48 +05:30
As decreasing a threshold too much could result in alerts for the
Apdex degradation, please also involve a Scalability team member in
the merge request.
2021-11-18 22:05:49 +05:30
## How to adjust the urgency
2021-12-11 22:18:48 +05:30
You can specify urgency similar to how endpoints
[get a feature category](../feature_categorization/index.md). Endpoints without a
specific target use the default urgency: 1s duration. These configurations
are available:
2021-11-18 22:05:49 +05:30
2021-12-11 22:18:48 +05:30
| Urgency | Duration in seconds | Notes |
|------------|---------------------|-----------------------------------------------|
| `:high` | 0.25s | |
| `:medium` | 0.5s | |
| `:default` | 1s | The default when nothing is specified. |
| `:low` | 5s | |
2021-11-18 22:05:49 +05:30
### Rails controller
2021-12-11 22:18:48 +05:30
An urgency can be specified for all actions in a controller:
2021-11-18 22:05:49 +05:30
```ruby
class Boards::ListsController < ApplicationController
urgency :high
end
```
2021-12-11 22:18:48 +05:30
To also specify the urgency for certain actions in a controller:
2021-11-18 22:05:49 +05:30
```ruby
class Boards::ListsController < ApplicationController
urgency :high, [:index, :show]
end
```
2023-05-27 22:25:52 +05:30
A custom RSpec matcher is available to check endpoint's request urgency in the controller specs:
```ruby
specify do
expect(get(:index, params: request_params)).to have_request_urgency(:medium)
end
```
2021-11-18 22:05:49 +05:30
### Grape endpoints
2021-12-11 22:18:48 +05:30
To specify the urgency for an entire API class:
2021-11-18 22:05:49 +05:30
```ruby
module API
class Issues < ::API::Base
urgency :low
end
end
```
2021-12-11 22:18:48 +05:30
To specify the urgency also for certain actions in a API class:
2021-11-18 22:05:49 +05:30
```ruby
module API
class Issues < ::API::Base
urgency :medium, [
'/groups/:id/issues',
'/groups/:id/issues_statistics'
]
end
end
```
Or, we can specify the urgency per endpoint:
```ruby
get 'client/features', urgency: :low do
# endpoint logic
end
```
2021-12-11 22:18:48 +05:30
2023-05-27 22:25:52 +05:30
A custom RSpec matcher is also compatible with grape endpoints' specs:
```ruby
specify do
expect(get(api('/avatar'), params: { email: 'public@example.com' })).to have_request_urgency(:medium)
end
```
2023-04-23 21:23:45 +05:30
WARNING:
We can't specify the urgency at the namespace level. The directive is ignored when doing so.
2021-12-11 22:18:48 +05:30
### Error budget attribution and ownership
This SLI is used for service level monitoring. It feeds into the
2023-04-23 21:23:45 +05:30
[error budget for stage groups](../stage_group_observability/index.md#error-budget).
2021-12-11 22:18:48 +05:30
For more information, read the epic for
[defining custom SLIs and incorporating them into error budgets](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525)).
The endpoints for the SLI feed into a group's error budget based on the
[feature category declared on it](../feature_categorization/index.md).
To know which endpoints are included for your group, you can see the
request rates on the
[group dashboard for your group](https://dashboards.gitlab.net/dashboards/f/stage-groups/stage-groups).
In the **Budget Attribution** row, the **Puma Apdex** log link shows you
how many requests are not meeting a 1s or 5s target.
2023-04-23 21:23:45 +05:30
For more information about the content of the dashboard, see
2022-08-13 15:12:31 +05:30
[Dashboards for stage groups](../stage_group_observability/index.md). For more information
2023-04-23 21:23:45 +05:30
about our exploration of the error budget itself, see
[issue 1365](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1365).