info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
comments: false
description: 'Learn how to operate on large time-decay data'
---
# Time-decay data
[Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/326035) in GitLab 14.0.
This document describes the *time-decay pattern* introduced in the
[Database Scalability Working Group](https://about.gitlab.com/company/team/structure/working-groups/database-scalability/#time-decay-data).
We discuss the characteristics of time-decay data, and propose best practices for GitLab development
to consider in this context.
Some datasets are subject to strong time-decay effects, in which recent data is accessed far more
frequently than older data. Another aspect of time-decay: with time, some types of data become
less important. This means we can also move old data to a bit less durable (less available) storage,
or even delete the data, in extreme cases.
Those effects are usually tied to product or application semantics. They can vary in the degree
that older data are accessed, and how useful or required older data are to the users or the
application.
Let's first consider entities with no inherent time-related bias for their data.
A record for a user or a project may be equally important and frequently accessed, irrelevant to when
it was created. We can not predict by using a user's `id` or `created_at` how often the related
record is accessed or updated.
On the other hand, a good example for datasets with extreme time-decay effects are logs and time
series data, such as events recording user actions.
Most of the time, that type of data may have no business use after a couple of days or weeks, and
quickly become less important even from a data analysis perspective. They represent a snapshot that
quickly becomes less and less relevant to the current state of the application, until at
some point it has no real value.
In the middle of the two extremes, we can find datasets that have useful information that we want to
keep around, but with old records seldom being accessed after an initial (small) time period after
creation.
## Characteristics of time-decay data
We are interested in datasets that show the following characteristics:
- **Size of the dataset:** they are considerably large.
- **Access methods:** we can filter the vast majority of queries accessing the dataset
by a time related dimension or a categorical dimension with time decay effects.
- **Immutability:** the time-decay status does not change.
- **Retention:** whether we want to keep the old data or not, or whether old
data should remain accessible by users through the application.
### Size of the dataset
There can be datasets of variable sizes that show strong time-decay effects, but in the context of
this blueprint, we intend to focus on entities with a **considerably large dataset**.
Smaller datasets do not contribute significantly to the database related resource usage, nor do they
inflict a considerable performance penalty to queries.
In contrast, large datasets over about 50 million records, or 100 GB in size, add a significant
overhead to constantly accessing a really small subset of the data. In those cases, we would want to
use the time-decay effect in our advantage and reduce the actively accessed dataset.
### Data access methods
The second and most important characteristic of time-decay data is that most of the time, we are
able to implicitly or explicitly access the data using a date filter,
**restricting our results based on a time-related dimension**.
the most commonly used, and the one that we can control and optimize against. It:
- Is immutable.
- Is set when the record is created
- Can be tied to physically clustering the records, without having to move them around.
It's important to add that even if time-decay data are not accessed that way by the application by
default, you can make the vast majority of the queries explicitly filter the data in such
a way. **Time decay data without such a time-decay related access method are of no use from an optimization perspective, as there is no way to set and follow a scaling pattern.**
We are not restricting the definition to data that are always accessed using a time-decay related
access method, as there may be some outlier operations. These may be necessary and we can accept
them not scaling, if the rest of the access methods can scale. An example:
an administrator accessing all past events of a specific type, while all other operations only access
a maximum of a month of events, restricted to 6 months in the past.
### Immutability
The third characteristic of time-decay data is that their **time-decay status does not change**.
Once they are considered "old", they can not switch back to "new" or relevant again.
This definition may sound trivial, but we have to be able to make operations over "old" data **more**
expensive (for example, by archiving or moving them to less expensive storage) without having to worry about
the repercussions of switching back to being relevant and having important application operations
underperforming.
Consider as a counter example to a time-decay data access pattern an application view that presents
issues by when they were updated. We are also interested in the most recent data from an "update"
perspective, but that definition is volatile and not actionable.
### Retention
Finally, a characteristic that further differentiates time-decay data in sub-categories with
slightly different approaches available is **whether we want to keep the old data or not**