debian-mirror-gitlab/doc/architecture/blueprints/object_pools/index.md
2023-07-09 08:55:56 +05:30

495 lines
23 KiB
Markdown

---
status: proposed
creation-date: "2023-03-30"
authors: [ "@pks-gitlab" ]
coach: [ ]
approvers: [ ]
owning-stage: "~devops::systems"
participating-stages: [ "~devops::create" ]
---
# Iterate on the design of object pools
## Summary
Forking repositories is at the heart of many modern workflows for projects
hosted in GitLab. As most of the objects between a fork and its upstream project
will typically be the same, this opens up potential for optimizations:
- Creating forks can theoretically be lightning fast if we reuse much of the
parts of the upstream repository.
- We can save on storage space by deduplicating objects which are shared.
This architecture is currently implemented with object pools which hold objects
of the primary repository. But the design of object pools has organically grown
and is nowadays showing its limits.
This blueprint explores how we can iterate on the design of object pools to fix
long standing issues with it. Furthermore, the intent is to arrive at a design
that lets us iterate more readily on the exact implementation details of object
pools.
## Motivation
The current design of object pools is showing problems with scalability in
various different ways. For a large part the problems come from the fact that
object pools have organically grown and that we learned as we went by.
It is proving hard to fix the overall design of object pools because there is no
clear ownership. While Gitaly provides the low-level building blocks to make
them work, it does not have enough control over them to be able to iterate on
their implementation details.
There are thus two major goals: taking ownership of object pools so that it
becomes easier to iterate on the design, and fixing scalability issues once we
can iterate.
### Lifecycle ownership
While Gitaly provides the interfaces to manage object pools, the actual life
cycle of them is controlled by the client. A typical lifecycle of an object pool
looks as following:
1. An object pool is created via `CreateObjectPool()`. The caller provides the
path where the object pool shall be created as well as the origin repository
from which the repository shall be created.
1. The origin repository needs to be linked to the object pool explicitly by
calling `LinkRepositoryToObjectPool()`.
1. The object pool needs to be regularly updated via `FetchIntoObjectPool()`
that fetches all changes from the primary pool member into the object pool.
1. To create forks, the client needs to call `CreateFork()` followed by
`LinkRepositoryToObjectPool()`.
1. Repositories of forks are unlinked by calling `DisconnectGitAlternates()`.
This will reduplicate objects.
1. The object pool is deleted via `DeleteObjectPool()`.
This lifecycle is complex and leaks a lot of implementation details to the
caller. This was originally done in part to give the Rails side control and
management over Git object visibility. GitLab project visibility rules are
complex and not a Gitaly concern. By exposing these details Rails can control
when pool membership links are created and broken. It is not clear at the
current point in time how the complete system works and its limits are not
explicitly documented.
In addition to the complexity of the lifecycle we also have multiple sources of
truth for pool membership. Gitaly never tracks the set of members of a pool
repository but can only tell for a specific repository that it is part of said
pool. Consequently, Rails is forced to maintain this information in a database,
but it is hard to maintain that information without becoming stale.
### Repository maintenance
Related to the lifecycle ownership issues is the issue of repository
maintenance. As mentioned, keeping an object pool up to date requires regular
calls to `FetchIntoObjectPool()`. This is leaking implementation details to the
client, but was done to give the client control over syncing the primary
repository with its object pool. With this control, private repositories can be
prevented from syncing and consquently leaking objects to other repositories in
the fork network.
We have had good success with moving repository maintenance into Gitaly so that
clients do not need to know about on-disk details. Ideally, we would do the same
for repositories that are the primary member of an object pool: if we optimize
its on-disk state, we will also automatically update the object pool.
There are two issues that keep us from doing so:
- Gitaly does not know about the relationship between an object pool and its
members.
- Updating object pools is expensive.
By making Gitaly the single source of truth for object pool memberships we would
be in a position to fix both issues.
### Fast forking
In the current implementation, Rails first invokes `CreateFork()` which results
in a complete `git-clone(1)` being performed to generate the fork repository.
This is followed by `LinkRepositoryToObjectPool()` to link the fork with the
object pool. It is not until housekeeping is performed on the fork repository
that objects are deduplicated. This is not only leaking implementation details
to clients, but it also keeps us from reaping the full potential benefit of
object pools.
In particular, creating forks is a lot slower than it could be since a clone is
always performed before linking. If the steps of creating the fork and linking
the fork to the pool repository were unified, the initial clone could be
avoided.
### Clustered object pools
Gitaly Cluster and object pools development overlapped. Consequently they are
known to not work well together. Praefect does neither ensure that repositories
with object pools have their object pools present on all nodes, nor does it
ensure that object pools are in a known state. If at all, object pools only work
by chance.
The current state has led to cases where object pools were missing or had
different contents per node. This can result in inconsistently observed state in
object pool members and writes that depend on the object pool's contents to
fail.
One way object pools might be handled for clustered Gitaly could be to have the
pool repositories duplicated on nodes that contain repositories dependent on
them. This would allow members of a fork network to exist of different nodes. To
make this work, repository replciation would have to be aware of object pools
and know when it needs to duplicate them onto a particular node.
## Requirements
There are a set of requirements and invariants that must be given for any
particular solution.
### Private upstream repositories should not leak objects to forks
When a project has a visibility setting that is not public, the objects in the
repository should not be fetched into an object pool. An object pool should only
ever contain objects from the upstream repository that were at one point public.
This prevents private upstream repositories from having objects leaked to forks
through a shared object pool.
### Forks cannot sneak objects into upstream projects
It should not be possible to make objects uploaded in a fork repository
accessible in the upstream repository via a shared object pool. Otherwise
potentially unauthorized users would be able to "sneak in" objects into
repositories by simply forking them.
Despite leading to confusion, this could also serve as a mechanism to corrupt
upstream repositories by introducing objects that are known to be broken.
### Object pool lifetime exceeds upstream repository lifetime
If the upstream repository gets deleted, its object pool should remain in place
to provide continued deduplication of shared objects between the other
repositories in the fork network. Thus it can be said that the lifetime of the
object pool is longer than the lifetime of the upstream repository. An object
pool should only be deleted if there are no longer any repositories referencing
it.
### Object lifetime
By deduplicating objects in a fork network, repositories become dependent on the
object pool. Missing objects in the pooled repository could lead to corruption
of repositories in the fork network. Therefore, objects in the pooled repository
must continue to exist as long as there are repositories referencing them.
Without a mechanism to accurately determine if a pooled object is referenenced
by one of more repositories, all objects in the pooled repository must remain.
Only when there are no repositories referencing the object pool can the pooled
repository, and therfore all its objects, be removed.
### Object sharing
An object that is deduplicated will become accessible from all forks of a
particular repository, even if it has never been reachable in any of the forks.
The consequence is that any write to an object pool immediately influences all
of its members.
We need to be mindful of this property when repositories connected to an object
pool are replicated. As the user-observable state should be the same on all
replicas, we need to ensure that both the repository and its object pool are
consistent across the different nodes.
## Proposal
In the current design, management of object pools mostly happens on the client
side as they need to manage their complete lifecyclethem. This requires Rails to
store the object pool relationships in the Rails database, perform fine-grained
management of every single step of an object pool's life, and perform periodic
Sidekiq jobs to enforce state by calling idempotent Gitaly RPCs. This design
significantly increases complexity of an already-complex mechanism.
Instead of handling the full lifecycle of object pools on the client-side, this
document proposes to instead encapsulate the object pool lifecycle management
inside of Gitaly. Instead of performing low-level actions to maintain object
pools, clients would only need to tell Gitaly about updated relationships
between a repository and its object pool.
This brings us multiple advantages:
- The inherent complexity of the lifecycle management is encapsulated in a
single place, namely Gitaly.
- Gitaly is in a better position to iterate on the low-level technical design of
object pools in case we find a better solution compared to "alternates" in the
future.
- We can ensure better interplay between Gitaly Cluster, object pools and
repository housekeeping.
- Gitaly becomes the single source of truth for object pool relationships and
can thus start to manage it better.
Overall, the goal is to raise the abstraction level so that clients need to
worry less about the technical details while Gitaly is in a better position to
iterate on them.
### Move lifecycle management of pools into Gitaly
The lifecycle management of object pools is leaking too many details to the
client, and by doing so makes parts things both hard to understand and
inefficient.
The current solution relies on a set of fine-grained RPCs that manage the
relationship between repositories and their object pools. Instead, we are aiming
for a simplified approach that only exposes the high-level concept of forks to
the client. This will happen in the form of three RPCs:
- `ForkRepository()` will create a fork of a given repository. If the upstream
repository does not yet have an object pool, Gitaly will create it. It will
then create the new repository and automatically link it to the object pool.
The upstream repository will be recorded as primary member of the object pool,
the fork will be recorded as a secondary member of the object pool.
- `UnforkRepository()` will remove a repository from the object pool it is
connected to. This will stop deduplication of objects. For the primary object
pool member this also means that Gitaly will stop pulling new objects into the
object pool.
- `GetObjectPool()` returns the object pool for a given repository. The pool
description will contain information about the pool's primary object pool
member as well as all secondary object pool members.
Furthermore, the following changes will be implemented:
- `RemoveRepository()` will remove the repository from its object pool. If it
was the last object pool member, the pool will be removed.
- `OptimizeRepository()`, when executed on the primary object pool member, will
also update and optimize the object pool.
- `ReplicateRepository()` needs to be aware of object pools and replicate them
correctly. Repositories shall be linked to and unlink from object pools as
required. While this is a step towards fixing the Praefect world, which may
seem redundant given that we plan to deprecate Praefect anyway, this RPC call
is also used for other use cases like repository rebalancing.
With these changes, Gitaly will have much tighter control over the lifecycle of
object pools. Furthermore, as it starts to track the membership of repositories
in object pools it can become the single source of truth for fork networks.
### Fix inefficient maintenance of object pools
In order to update object pools, Gitaly performs a fetch of new objects from the
primary object pool member into the object pool. This fetch is inefficient as it
needs to needlessly negotiate objects that are new in the primary object pool
member. But given that objects are deduplicated already in the primary object
pool member it means that it should only have objects in its object database
that do not yet exist in the object pool. Consequently, we should be able to
skip the negotiation completely and instead link all objects into the object
pool that exist in the source repository.
In the current design, these objects are kept alive by creating references to
the just-fetched objects. If the fetch deleted references or force-updated any
references, then it may happen that previously-referenced objects become
unreferenced. Gitaly thus creates keep-around references so that they cannot
ever be deleted. Furthermore, those references are required in order to properly
replicate object pools as the replication is reference-based.
These two things can be solved in different ways:
- We can set the `preciousObjects` repository extension. This will instruct all
versions of Git which understand this extension to never delete any objects
even if `git-prune(1)` or similar commands were executed. Versions of Git that
do not understand this extension would refuse to work in this repository.
- Instead of replicating object pools via `git-fetch(1)`, we can instead
replicate them by sending over all objects part of the object database.
Taken together this means that we can stop writing references in object pools
altogether. This leads to efficient updates of object pools by simply linking
all new objects into place, and it fixes issues we have seen with unbounded
growth of references in object pools.
## Design and implementation details
<!--
This section intentionally left blank. I first want to reach consensus on the
bigger picture I'm proposing in this blueprint before I iterate and fill in the
lower-level design and implementation details.
-->
## Problems with the design
As mentioned before, object pools are not a perfect solution. This section goes
over the most important issues.
### Complexity of lifecycle management
Even though the lifecycle of object pools becomes easier to handle once it is
fully owned by Gitaly, it is still complex and needs to be considered in many
ways. Handling object pools in combination with their repositories is not an
atomic operation as any action by necessity spans over at least two different
resources.
### Performance issues
As object pools deduplicate objects, the end result is that object pool members
never have the full closure of objects in a single packfile. This is not
typically an issue for the primary object pool member, which by definition
cannot diverge from the object pool's contents. But secondary object pool
members can and often will diverge from the original contents of the upstream
repository.
This leads to two different sets of reachable objects in secondary object pool
members. Unfortunately, due to limitations in Git itself, this precludes the use
of a subset of optimizations:
- Packfiles cannot be reused as efficiently when serving fetches to serve
already-deltified objects. This requires Git to recompute deltas on the fly
for object pool members which have diverged from object pools.
- Packfile bitmaps can only exist in object pools as it is not possible nor
easily feasible for these bitmaps to cover multiple object databases. This
requires Git to traverse larger parts of the object graph for many operations
and especially when serving fetches.
### Dependent writes across repositories
The design of object pools introduces significant complexity into the Raft world
where we use a write-ahead log for all changes to repositories. In the ideal
case, a Raft-based design would only need to care about the write-ahead log of a
single repository when considering requests. But with object pools, we are
forced to consider both reads and writes for a pooled repository to be dependent
on all writes in its object pool having been applied.
## Alternative Solutions
The proposed solution is not obviously the best choice as it has issues both
with complexity (management of the lifecycle) and performance (inefficiently
served fetches for pool members).
This section explores alternatives to object pools and why they have not been
chosen as the new target architecture.
### Stop using object pools altogether
An obvious way to avoid all of the complexity is to stop using object pools
altogether. While it is charming from an engineering point of view as we can
significantly simplify the architecture, it is not a viable approach from the
product perspective as it would mean that we cannot support efficient forking
workflows.
### Primary repository as object pool
Instead of creating an explicit object pool repository, we could just use the
upstream repository as an alternate object database of all forks. This avoids a
lot of complexity around managing the lifetime of the object pool, at least
superficially. Furthermore, it circumvents the issue of how to update object
pools as it will always match the contents of the upstream repository.
It has a number of downsides though:
- Normal repositories can now have different states, where some of the
repositories are allowed to prune objects and others aren't. This introduces a
source of uncertainty and makes it easy to accidentally delete objects in a
normal repository and thus corrupt its forks.
- When upstream repositories go private we must stop updating objects which are
supposed to be deduplicated across members of the fork network. This means
that we would ultimately still be forced to create object pools once this
happens in order to freeze the set of deduplicated objects at the point in
time where the repository goes private.
- Deleting repositories becomes more complex as we need to take into account
whether a repository is linked to by forks.
### Reference namespaces
With `gitnamespaces(7)`, Git provides a mechanism to partition references into
different sets of namespaces. This allows us to serve all forks from a single
repository that contains all objects.
One neat property is that we have the global view of objects referenced by all
forks together in a single object database. We can thus easily perform shared
housekeeping across all forks at once, including deletion of objects that are
not used by any of the forks anymore. Regarding objects, this is likely to be
the most efficient solution we could potentially aim for.
There are again some downsides though:
- Calculating usage quotas must by necessity use actual reachability of objects
into account, which is expensive to compute. This is not a showstopper, but
something to keep in mind.
- One stated requirement is that it must not be possible to make objects
reachable in other repositories from forks. This property could theoretically
be enforced by only allowing access to reachable objects. That way an object
can only be accessed through virtual repository if the object is reachable from
its references. Reachability checks are too compute heavy for this to be practical.
- Even though references are partitioned, large fork networks would still easily
end up with multiple millions of references. It is unclear what the impact on
performance would be.
- The blast radius for any repository-level attacks significantly increases as
you would not only impact your own repository, but also all forks.
- Custom hooks would have to be isolated for each of the virtual repositories.
Since the execution of Git hooks is controled it should be possible to handle
this for each of the namespaces.
### Filesystem-based deduplication
The idea of deduplicating objects on the filesystem level was floating around at
several points in time. While it would be nice if we could shift the burden of
this to another component, it is likely not easy to implement due to the nature
of how Git works.
The most important contributing factor to repository sizes are Git objects.
While it would be possible to store the objects in their loose representation
and thus deduplicate on that level, this is infeasible:
- Git would not be able to deltify objects, which is an extremely important
mechanism to reduce on-disk size. It is unlikely that the size reduction
caused by deduplication would outweigh the size reduction gained from the
deltification mechanism.
- Loose objects are significantly less efficient when accessing the repository.
- Serving fetches requires us to send a packfile to the client. Usually, Git is
able to reuse large parts of already-existing packfiles, which significantly
reduces the computational overhead.
Deduplicating on the loose-object level is thus infeasible.
The other unit that one could try to deduplicate is packfiles. But packfiles are
not deterministically generated by Git and will furthermore be different once
repositories start to diverge from each other. So packfiles are not a natural
fit for filesystem-level deduplication either.
An alternative could be to use hard links of packfiles across repositories. This
would cause us to duplicate storage space whenever any repository decides to
perform a repack of objects and would thus be unpredictable and hard to manage.
### Custom object backend
In theory, it would be possible to implement a custom object backend that allows
us to store objects in such a way that we can deduplicate them across forks.
There are several technical hurdles though that keep us from doing so without
significant upstream investments:
- Git is not currently designed to have different backends for objects. Accesses
to files part of the object database are littered across the code base with no
abstraction level. This is in contrast to the reference database, which has at
least some level of abstraction.
- Implementing a custom object backend would likely necessitate a fork of the
Git project. Even if we had the resources to do so, it would introduce a major
risk factor due to potential incompatibilities with upstream changes. It would
become impossible to use vanilla Git, which is often a requirement that exists
in the context of Linux distributions that package GitLab.
Both the initial and the operational risk of ongoing maintenance are too high to
really justify this approach for now. We might revisit this approach in the
future.