info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
comments: false
description: 'Learn how to scale operating on read-mostly data at scale'
---
# Read-mostly data
[Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/326037) in GitLab 14.0.
This document describes the *read-mostly* pattern introduced in the
[Database Scalability Working Group](https://about.gitlab.com/company/team/structure/working-groups/database-scalability/#read-mostly-data).
We discuss the characteristics of *read-mostly* data and propose best practices for GitLab development
to consider in this context.
## Characteristics of read-mostly data
As the name already suggests, *read-mostly* data is about data that is much more often read than
updated. Writing this data through updates, inserts, or deletes is a very rare event compared to
reading this data.
In addition, *read-mostly* data in this context is typically a small dataset. We explicitly don't deal
with large datasets here, even though they often have a "write once, read often" characteristic, too.
### Example: license data
Let's introduce a canonical example: license data in GitLab. A GitLab instance may have a license
attached to use GitLab enterprise features. This license data is held instance-wide, that
is, there typically only exist a few relevant records. This information is kept in a table
`licenses` which is very small.
We consider this *read-mostly* data, because it follows above outlined characteristics:
- **Rare writes:** license data very rarely sees any writes after having inserted the license.
- **Frequent reads:** license data is read extremely often to check if enterprise features can be used.
- **Small size:** this dataset is very small. On GitLab.com we have 5 records at <50kBtotalrelationsize.
### Effects of *read-mostly* data at scale
Given this dataset is small and read very often, we can expect data to nearly always reside in
database caches and/or database disk caches. Thus, the concern with *read-mostly* data is typically
not around database I/O overhead, because we typically don't read data from disk anyway.
However, considering the high frequency reads, this has potential to incur overhead in terms of
database CPU load and database context switches. Additionally, those high frequency queries go
through the whole database stack. They also cause overhead on the database connection
multiplexing components and load balancers. Also, the application spends cycles in preparing and
sending queries to retrieve the data, deserialize the results and allocate new objects to represent
the information gathered - all in a high frequency fashion.
In the example of license data above, the query to read license data was
[identified](https://gitlab.com/gitlab-org/gitlab/-/issues/292900) to stand out in terms of query
frequency. In fact, we were seeing around 6,000 queries per second (QPS) on the cluster during peak
times. With the cluster size at that time, we were seeing about 1,000 QPS on each replica, and fewer
than 400 QPS on the primary at peak times. The difference is explained by our
[database load balancing for scaling reads](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/lib/gitlab/database/load_balancing.rb),
which favors replicas for pure read-only transactions.
The overall transaction throughput on the database primary at the time varied between 50,000 and
70,000 transactions per second (TPS). In comparison, this query frequency only takes a small
portion of the overall query frequency. However, we do expect this to still have considerable
overhead in terms of context switches. It is worth removing this overhead, if we can.
## How to recognize read-mostly data
It can be difficult to recognize *read-mostly* data, even though there are clear cases like in our
example.
One approach is to look at the [read/write ratio and statistics from, for example, the primary](https://bit.ly/3frdtyz). Here, we look at the TOP20 tables by their read/write ratio over 60 minutes (taken in a peak traffic time):
This yields a good impression of which tables are much more often read than written (on the database
primary):
![Read Write Ratio TOP20](img/read_mostly_readwriteratio_v14_2.png)
From here, we can [zoom](https://bit.ly/2VmloX1) into for example `gitlab_subscriptions` and realize that index reads peak at above 10k tuples per second overall (there are no seq scans):