by [`Namespaces#with_statistics`](https://gitlab.com/gitlab-org/gitlab/blob/4ab54c2233e91f60a80e5b6fa2181e6899fdcc3e/app/models/namespace.rb#L70) scope. Analyzing this query we noticed that:
Because of all of the above, we can't apply the same pattern to store
and update the namespaces statistics, as the `namespaces` table is one
of the largest tables on GitLab.com. Therefore we needed to find a performant and
alternative method.
## Attempts
###Attempt A: PostgreSQL materialized view
Model can be updated through a refresh strategy based on a project routes SQL and a [materialized view](https://www.postgresql.org/docs/9.6/rules-materializedviews.html):
```sql
SELECT split_part("rs".path, '/', 1) as root_path,
COALESCE(SUM(ps.storage_size), 0) AS storage_size,
COALESCE(SUM(ps.repository_size), 0) AS repository_size,
COALESCE(SUM(ps.wiki_size), 0) AS wiki_size,
COALESCE(SUM(ps.lfs_objects_size), 0) AS lfs_objects_size,
COALESCE(SUM(ps.build_artifacts_size), 0) AS build_artifacts_size,
COALESCE(SUM(ps.packages_size), 0) AS packages_size
FROM "projects"
INNER JOIN routes rs ON rs.source_id = projects.id AND rs.source_type = 'Project'
INNER JOIN project_statistics ps ON ps.project_id = projects.id
While this implied a single query update (and probably a fast one), it has some downsides:
- Materialized views syntax varies from PostgreSQL and MySQL. While this feature was worked on, MySQL was still supported by GitLab.
- Rails does not have native support for materialized views. We'd need to use a specialized gem to take care of the management of the database views, which implies additional work.
###Attempt B: An update through a CTE
Similar to Attempt A: Model update done through a refresh strategy with a [Common Table Expression](https://www.postgresql.org/docs/9.1/queries-with.html)
```sql
WITH refresh AS (
SELECT split_part("rs".path, '/', 1) as root_path,
COALESCE(SUM(ps.storage_size), 0) AS storage_size,
COALESCE(SUM(ps.repository_size), 0) AS repository_size,
COALESCE(SUM(ps.wiki_size), 0) AS wiki_size,
COALESCE(SUM(ps.lfs_objects_size), 0) AS lfs_objects_size,
COALESCE(SUM(ps.build_artifacts_size), 0) AS build_artifacts_size,
COALESCE(SUM(ps.packages_size), 0) AS packages_size
FROM "projects"
INNER JOIN routes rs ON rs.source_id = projects.id AND rs.source_type = 'Project'
INNER JOIN project_statistics ps ON ps.project_id = projects.id
- We'd have to migrate **all namespaces** by adding and filling a new column. Because of the size of the table, dealing with time/cost will not be great. The background migration will take approximately `153h`, see <https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/29772>.
- Background migration has to be shipped one release before, delaying the functionality by another milestone.
###Attempt E (final): Update the namespace storage statistics in async way
This approach consists of keep using the incremental statistics updates we currently already have,
but we refresh them through Sidekiq jobs and in different transactions:
1. Create a second table (`namespace_aggregation_schedules`) with two columns `id` and `namespace_id`.
1. Whenever the statistics of a project changes, insert a row into `namespace_aggregation_schedules`
- We don't insert a new row if there's already one related to the root namespace.
- Keeping in mind the length of the transaction that involves updating `project_statistics`(<https://gitlab.com/gitlab-org/gitlab-foss/issues/62488>), the insertion should be done in a different transaction and through a Sidekiq Job.
1. After inserting the row, we schedule another worker to be executed async at two different moments:
- One enqueued for immediate execution and another one scheduled in `1.5h` hours.
- We only schedule the jobs, if we can obtain a `1.5h` lease on Redis on a key based on the root namespace ID.
- If we can't obtain the lease, it indicates there's another aggregation already in progress, or scheduled in no more than `1.5h`.
1. This worker will:
- Update the root namespace storage statistics by querying all the namespaces through a service.
- Delete the related `namespace_aggregation_schedules` after the update.
1. Another Sidekiq job is also included to traverse any remaining rows on the `namespace_aggregation_schedules` table and schedule jobs for every pending row.
- This job is scheduled with cron to run every night (UTC).
This implementation has the following benefits:
- All the updates are done async, so we're not increasing the length of the transactions for `project_statistics`.
- We're doing the update in a single SQL query.
- It is compatible with PostgreSQL and MySQL.
- No background migration required.
The only downside of this approach is that namespaces' statistics are updated up to `1.5` hours after the change is done,
which means there's a time window in which the statistics are inaccurate. Because we're still not
[enforcing storage limits](https://gitlab.com/gitlab-org/gitlab-foss/issues/30421), this is not a major problem.
##Conclusion
Updating the storage statistics asynchronously, was the less problematic and
performant approach of aggregating the root namespaces.
All the details regarding this use case can be found on: