debian-mirror-gitlab/doc/development/database/clickhouse/tiered_storage.md
2023-05-27 22:25:52 +05:30

5.3 KiB

stage group info
Data Stores Database To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments

Tiered Storages in ClickHouse

NOTE: The MergeTree table engine in ClickHouse supports tiered storage. See the documentation for Using Multiple Block Devices for Data Storage for details on setup and further explanation.

Quoting from the MergeTree documentation:

MergeTree family table engines can store data on multiple block devices. For example, it can be useful when the data of a certain table are implicitly split into "hot" and "cold". The most recent data is regularly requested but requires only a small amount of space. On the contrary, the fat-tailed historical data is requested rarely.

When used with remote storage backends such as Amazon S3, this makes a very efficient storage scheme. It allows for storage policies, which allows data to be on local disks for a period of time and then move it to object storage.

An example configuration can look like this:

<storage_configuration>
    <disks>
        <fast_ssd>
            <path>/mnt/fast_ssd/clickhouse/</path>
        </fast_ssd>
        <gcs>
            <support_batch_delete>false</support_batch_delete>
            <type>s3</type>
            <endpoint>https://storage.googleapis.com/${BUCKET_NAME}/${ROOT_FOLDER}/</endpoint>
            <access_key_id>${SERVICE_ACCOUNT_HMAC_KEY}</access_key_id>
            <secret_access_key>${SERVICE_ACCOUNT_HMAC_SECRET}</secret_access_key>
            <metadata_path>/var/lib/clickhouse/disks/gcs/</metadata_path>
        </gcs>
     ...
    </disks>
    ...
    <policies>

        <move_from_local_disks_to_gcs> <!-- policy name -->
            <volumes>
                <hot> <!-- volume name -->
                    <disk>fast_ssd</disk>  <!-- disk name -->
                </hot>
                <cold>
                    <disk>gcs</disk>
                </cold>
            </volumes>
            <move_factor>0.2</move_factor>
            <!-- The move factor determines when to move data from hot volume to cold.
                 See ClickHouse docs for more details. -->
        </moving_from_ssd_to_hdd>
    ....
</storage_configuration>

In this storage policy, two volumes are defined hot and cold. After the hot volume is filled with occupancy of disk_size * move_factor, the data is being moved to Google Cloud Storage (GCS).

If this storage policy is not the default, create tables by attaching the storage policies. For example:

CREATE TABLE key_value_table (
    event_date Date,
    key String,
    value String,
) ENGINE = MergeTree
ORDER BY (key)
PARTITION BY toYYYYMM(event_date)
SETTINGS storage_policy = 'move_from_local_disks_to_gcs'

NOTE: In this storage policy, the move happens implicitly. It is also possible to keep hot data on local disks for a fixed period of time and then move them as cold.

This approach is possible with Table TTLs, which are also available with MergeTree table engine.

The ClickHouse documentation shows this feature in detail, in the example of implementing a hot - warm - cold architecture.

You can take a similar approach for the example shown above. First, adjust the storage policy:

<storage_configuration>
    ...
    <policies>
        <local_disk_and_gcs> <!-- policy name -->
            <volumes>
                <hot> <!-- volume name -->
                    <disk>fast_ssd</disk>  <!-- disk name -->
                </hot>
                <cold>
                    <disk>gcs</disk>
                </cold>
            </volumes>
        </local_disk_and_gcs>
    ....
</storage_configuration>

Then create the table as:

CREATE TABLE another_key_value_table (
    event_date Date,
    key String,
    value String,
) ENGINE = MergeTree
ORDER BY (key)
PARTITION BY toYYYYMM(event_date)
TTL
    event_date TO VOLUME 'hot',
    event_date + INTERVAL 1 YEAR TO VOLUME 'cold'
SETTINGS storage_policy = 'local_disk_and_gcs';

This creates the table so that data older than 1 year (evaluated against the event_date column) is moved to GCS. Such a storage policy can be helpful for append-only tables (like audit events) where only the most recent data is accessed frequently. You can drop the data altogether, which can be a regulatory requirement.

We don't mention modifying TTLs in this guide, but that is possible as well. See ClickHouse documentation for modifying TTL for details.