The current architecture has several points of coupling between concerns.
Coupling reduces opportunities for abstraction (e.g. community supported
plugins) and increases complexity, making the code harder to understand,
test, maintain and extend.
A primary design decision will be which concerns to externalize to the plugin
and which should remain with the runner system. The current implementation
has several abstractions internally which could be used as cut points for a
new abstraction.
For example the [`Build`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/build.go#L125)
type uses the [`GetExecutorProvider`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/executor.go#L171)
function to get an executor provider based on a dispatching executor string.
Various executor types register with the system by being imported and calling
during initialization. Here the abstractions are the [`ExecutorProvider`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/executor.go#L80)
and [`Executor`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/executor.go#L59)
interfaces.
Within the `docker+autoscaling` executor the [`machineExecutor`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/executors/docker/machine/machine.go#L19)
type has a [`Machine`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/helpers/docker/machine.go#L7)
interface which it uses to acquire a VM during the common [`Prepare`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/executors/docker/machine/machine.go#L71)
phase. This abstraction primarily creates, accesses and deletes VMs.
There is no current abstraction for the VM autoscaling logic. It is tightly
coupled with the VM lifecycle and job routing logic. Creating idle capacity
happens as a side-effect of calling [`Acquire`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/executors/docker/machine/provider.go#L449) on the `machineProvider` while binding a job to a VM.
There is also no current abstraction for in-VM job execution. VM-specific
commands are generated by the Runner Manager using the [`GenerateShellScript`](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/build.go#L336)
function and [injected](https://gitlab.com/gitlab-org/gitlab-runner/-/blob/267f40d871cd260dd063f7fbd36a921fedc62241/common/build.go#L373)
into the VM as the manager drives the job execution stages.
### Design principles
Our goal is to design a GitLab Runner plugin system interface that is flexible
and simple for the wider community to consume. As we cannot build plugins for
all cloud platforms, we want to ensure a low entry barrier for anyone who needs
to develop a plugin. We want to allow everyone to contribute.
To achieve this goal, we will follow a few critical design principles. These
principles will guide our development process for the new plugin system
abstraction.
#### General high-level principles
- Design the new auto-scaling architecture aiming for having more choices and
flexibility in the future, instead of imposing new constraints.
- Design the new auto-scaling architecture to experiment with running multiple
jobs in parallel, on a single machine.
- Design the new provisioning architecture to replace Docker Machine in a way
that the wider community can easily build on top of the new abstractions.
- New auto-scaling method should become a core component of GitLab Runner product so that
we can simplify maintenance, use the same tooling, test configuration and Go language
setup as we do in our other main products.
- It should support multiple job execution environments - not only Docker containers
on Linux operating system.
The best design would be to bring auto-scaling as a feature wrapped around
our current executors like Docker or Shell.
#### Principles for the new plugin system
- Make the entry barrier for writing a new plugin low.
- Developing a new plugin should be simple and require only basic knowledge of
a programming language and a cloud provider's API.
- Strive for a balance between the plugin system's simplicity and flexibility.
These are not mutually exclusive.
- Abstract away as many technical details as possible but do not hide them completely.
- Build an abstraction that serves our community well but allows us to ship it quickly.
- Invest in a flexible solution, avoid one-way-door decisions, foster iteration.
- When in doubts err on the side of making things more simple for the wider community.
- Limit coupling between concerns to make the system more simple and extensible.
- Concerns should live on one side of the plug or the other--not both, which
duplicates effort and increases coupling.
#### The most important technical details
- Favor gRPC communication between a plugin and GitLab Runner.
- Make it possible to version communication interface and support many versions.
- Make Go a primary language for writing plugins but accept other languages too.
- Autoscaling mechanism should be fully owned by GitLab.
Cloud provider autoscalers don't know which VM to delete when scaling down so
they make sub-optimal decisions. Rather than teaching all autoscalers about GitLab
jobs, we prefer to have one, GitLab-owned autoscaler (not in the plugin).
It will also ensure that we can shape the future of the mechanism and make decisions
that fit our needs and requirements.
## Plugin boundary proposals
The following are proposals for where to draw the plugin boundary. We will evaluate
these proposals and others by the design principles and technical constraints
- **[GitLab Runner](../../../development/documentation/styleguide/word_list.md#gitlab-runner)** - the software application that you can choose to install and manage, whose source code is hosted at `gitlab.com/gitlab-org/gitlab-runner`.
- **[runners](../../../development/documentation/styleguide/word_list.md#runner-runners)** - the runner is the agent that's responsible for running GitLab CI/CD jobs in an environment and reporting the results to a GitLab instance. It /1/ retrieves jobs from GitLab, /2/ configures a local or remote build environment, and /3/ executes jobs within the provisioned environment, passing along log data and status updates to GitLab.
- **runner manager** - the runner process is often referred to as the `Runner Manager` as it manages multiple runners, which are the `[[runners]]` workers defined in the runners `config.toml` file.
- **executor** - a concrete environment which can be prepared and used to run a job. A new executor is created for each job.
- **executor provider** - an implementation capable of providing executors on demand. Executor providers are registered on import and initialized once when a runner starts up.
- **custom executor** - works as an interface between GitLab Runner and a set of binaries or shell scripts with environment variable inputs that enable executing CI jobs in any host computing environment. New custom executors can be added to the system without making any changes to the GitLab Runner codebase.
- **custom executor provider** - a new abstraction, proposed under the custom provider heading in the plugin boundary proposal section above, which allows new executor providers to be created without modifying the GitLab Runner codebase. The protocol could be similar to custom executors or done over gRPC. This abstraction places all the mechanics of producing executors within the plugin, delegating autoscaling and lifecycle management concerns to each implementation.
- **taskscaler** - a new library, proposed under the taskscaler provider heading in the plugin boundary proposal section above, which is parameterized with a concrete executor provider and a fleeting provider. Taskscaler is responsible for the autoscaling concern and can be used to autoscale any executor provider using any VM shape. Taskscaler is also responsible for the runner-specific aspect of VM lifecycle and keeps track of how many jobs are using a give VM and how many times a VM has been used.
- **fleeting** - a new library proposed along with taskscaler which provides abstractions for cloud provider VMs.
- **fleeting instance group** - the abstraction that fleeting uses to represent a pool of like VMs. This would represent a GCP IGM or an AWS ASG (without the autoscaling). Instance groups can be increased, decreased or can provide connection details for a specific VM.
- **fleeting plugin** - a concrete implementation of a fleeting instance group representing a specific IGM or ASG (when initialized). There will be N of these, one for each provider, each in its own project. We will own and maintain the core ones but some will be community supported. A new fleeting plugin can be created without making any changes to the runner, taskscaler or fleeting code bases. This makes it analogous to the custom executor provider in terms of self-service and decoupling, but along a different line of concerns.
- **fleeting plugin Google Compute** - the fleeting plugin which creates GCP instances. This lives in a separate project from the fleeting and taskscaler.
- **fleeting plugin AWS** - the fleeting plugin which creates AWS instances. This lives in a separate project from the fleeting and taskscaler.