debian-mirror-gitlab/doc/development/database/clickhouse/tiered_storage.md

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

139 lines
5.3 KiB
Markdown
Raw Normal View History

2023-05-27 22:25:52 +05:30
---
stage: Data Stores
group: Database
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
---
# Tiered Storages in ClickHouse
NOTE:
The MergeTree table engine in ClickHouse supports tiered storage.
See the documentation for [Using Multiple Block Devices for Data Storage](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes)
for details on setup and further explanation.
Quoting from the [MergeTree documentation](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes):
<!-- vale gitlab.Simplicity = NO -->
> MergeTree family table engines can store data on multiple block devices. For example,
> it can be useful when the data of a certain table are implicitly split into "hot" and "cold".
> The most recent data is regularly requested but requires only a small amount of space.
> On the contrary, the fat-tailed historical data is requested rarely.
<!-- vale gitlab.Simplicity = YES -->
When used with remote storage backends such as
[Amazon S3](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-s3),
this makes a very efficient storage scheme. It allows for storage policies, which
allows data to be on local disks for a period of time and then move it to object storage.
An [example configuration](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-multiple-volumes_configure) can look like this:
```xml
<storage_configuration>
<disks>
<fast_ssd>
<path>/mnt/fast_ssd/clickhouse/</path>
</fast_ssd>
<gcs>
<support_batch_delete>false</support_batch_delete>
<type>s3</type>
<endpoint>https://storage.googleapis.com/${BUCKET_NAME}/${ROOT_FOLDER}/</endpoint>
<access_key_id>${SERVICE_ACCOUNT_HMAC_KEY}</access_key_id>
<secret_access_key>${SERVICE_ACCOUNT_HMAC_SECRET}</secret_access_key>
<metadata_path>/var/lib/clickhouse/disks/gcs/</metadata_path>
</gcs>
...
</disks>
...
<policies>
<move_from_local_disks_to_gcs> <!-- policy name -->
<volumes>
<hot> <!-- volume name -->
<disk>fast_ssd</disk> <!-- disk name -->
</hot>
<cold>
<disk>gcs</disk>
</cold>
</volumes>
<move_factor>0.2</move_factor>
<!-- The move factor determines when to move data from hot volume to cold.
See ClickHouse docs for more details. -->
</moving_from_ssd_to_hdd>
....
</storage_configuration>
```
In this storage policy, two volumes are defined `hot` and `cold`. After the `hot` volume is filled with occupancy of `disk_size * move_factor`, the data is being moved to Google Cloud Storage (GCS).
If this storage policy is not the default, create tables by attaching the storage policies. For example:
```sql
CREATE TABLE key_value_table (
event_date Date,
key String,
value String,
) ENGINE = MergeTree
ORDER BY (key)
PARTITION BY toYYYYMM(event_date)
SETTINGS storage_policy = 'move_from_local_disks_to_gcs'
```
NOTE:
In this storage policy, the move happens implicitly. It is also possible to keep
_hot_ data on local disks for a fixed period of time and then move them as _cold_.
This approach is possible with
[Table TTLs](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree-table-ttl),
which are also available with MergeTree table engine.
The ClickHouse documentation shows this feature in detail, in the example of
[implementing a hot - warm - cold architecture](https://clickhouse.com/docs/en/guides/developer/ttl/#implementing-a-hotwarmcold-architecture).
You can take a similar approach for the example shown above. First, adjust the storage policy:
```xml
<storage_configuration>
...
<policies>
<local_disk_and_gcs> <!-- policy name -->
<volumes>
<hot> <!-- volume name -->
<disk>fast_ssd</disk> <!-- disk name -->
</hot>
<cold>
<disk>gcs</disk>
</cold>
</volumes>
</local_disk_and_gcs>
....
</storage_configuration>
```
Then create the table as:
```sql
CREATE TABLE another_key_value_table (
event_date Date,
key String,
value String,
) ENGINE = MergeTree
ORDER BY (key)
PARTITION BY toYYYYMM(event_date)
TTL
event_date TO VOLUME 'hot',
event_date + INTERVAL 1 YEAR TO VOLUME 'cold'
SETTINGS storage_policy = 'local_disk_and_gcs';
```
This creates the table so that data older than 1 year (evaluated against the
`event_date` column) is moved to GCS. Such a storage policy can be helpful for append-only
tables (like audit events) where only the most recent data is accessed frequently.
You can drop the data altogether, which can be a regulatory requirement.
We don't mention modifying TTLs in this guide, but that is possible as well.
See ClickHouse documentation for
[modifying TTL](https://clickhouse.com/docs/en/sql-reference/statements/alter/ttl/#modify-ttl)
for details.