debian-mirror-gitlab/doc/development/database/migrations_for_multiple_databases.md
2022-07-17 14:43:12 +02:00

17 KiB

stage group info
Enablement Database To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments

Migrations for Multiple databases

Support for describing migration purposes was introduced in GitLab 14.8.

This document describes how to properly write database migrations for the decomposed GitLab application using multiple databases.

Learn more about general multiple databases support in a separate document.

WARNING: If you experience any issues using Gitlab::Database::Migration[2.0], you can temporarily revert back to the previous behavior by changing the version to Gitlab::Database::Migration[1.0]. Please report any issues with Gitlab::Database::Migration[2.0] in this issue.

The design for multiple databases (except for the Geo database) assumes that all decomposed databases have the same structure (for example, schema), but the data is different in each database. This means that some tables do not contain data on each database.

Operations

Depending on the used constructs, we can classify migrations to be either:

  1. Modifying structure (DDL - Data Definition Language) (for example, ALTER TABLE).
  2. Modifying data (DML - Data Manipulation Language) (for example, UPDATE).
  3. Performing other queries (for example, SELECT) that are treated as DML for the purposes of our migrations.

The usage of Gitlab::Database::Migration[2.0] requires migrations to always be of a single purpose. Migrations cannot mix DDL and DML changes as the application requires the structure (as described by db/structure.sql) to be exactly the same across all decomposed databases.

Data Definition Language (DDL)

The DDL migrations are all migrations that:

  1. Create or drop a table (for example, create_table).
  2. Add or remove an index (for example, add_index, add_index_concurrently).
  3. Add or remove a foreign key (for example add_foreign_key, add_foreign_key_concurrently).
  4. Add or remove a column with or without a default value (for example, add_column).
  5. Create or drop trigger functions (for example, create_trigger_function).
  6. Attach or detach triggers from tables (for example, track_record_deletions, untrack_record_deletions).
  7. Prepare or not asynchronous indexes (for example, prepare_async_index, unprepare_async_index_by_name).

As such DDL migrations CANNOT:

  1. Read or modify data in any form, via SQL statements or ActiveRecord models.
  2. Update column values (for example, update_column_in_batches).
  3. Schedule background migrations (for example, queue_background_migration_jobs_by_range_at_intervals).
  4. Read the state of feature flags since they are stored in main: (a features and feature_gates).
  5. Read application settings (as settings are stored in main:).

As the majority of migrations in the GitLab codebase are of the DDL-type, this is also the default mode of operation and requires no further changes to the migrations files.

Example: perform DDL on all databases

Example migration adding a concurrent index that is treated as change of the structure (DDL) that is executed on all configured databases.

class AddUserIdAndStateIndexToMergeRequestReviewers < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  INDEX_NAME = 'index_on_merge_request_reviewers_user_id_and_state'

  def up
    add_concurrent_index :merge_request_reviewers, [:user_id, :state], where: 'state = 2', name: INDEX_NAME
  end

  def down
    remove_concurrent_index_by_name :merge_request_reviewers, INDEX_NAME
  end
end

Data Manipulation Language (DML)

The DML migrations are all migrations that:

  1. Read data via SQL statements (for example, SELECT * FROM projects WHERE id=1).
  2. Read data via ActiveRecord models (for example, User < MigrationRecord).
  3. Create, update or delete data via ActiveRecord models (for example, User.create!(...)).
  4. Create, update or delete data via SQL statements (for example, DELETE FROM projects WHERE id=1).
  5. Update columns in batches (for example, update_column_in_batches(:projects, :archived, true)).
  6. Schedule background migrations (for example, queue_background_migration_jobs_by_range_at_intervals).
  7. Access application settings (for example, ApplicationSetting.last if run for main: database).
  8. Read and modify feature flags if run for the main: database.

The DML migrations CANNOT:

  1. Make any changes to DDL since this breaks the rule of keeping structure.sql coherent across all decomposed databases.
  2. Read data from another database.

To indicate the DML migration type, a migration must use the restrict_gitlab_migration gitlab_schema: syntax in a migration class. This marks the given migration as DML and restricts access to it.

Example: perform DML only in context of the database containing the given gitlab_schema

Example migration updating archived column of projects that is executed only for the database containing gitlab_main schema.

class UpdateProjectsArchivedState < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  restrict_gitlab_migration gitlab_schema: :gitlab_main

  def up
    update_column_in_batches(:projects, :archived, true) do |table, query|
      query.where(table[:archived].eq(false)) # rubocop:disable CodeReuse/ActiveRecord
    end
  end

  def down
    # no-op
  end
end

Example: usage of ActiveRecord classes

A migration using ActiveRecord class to perform data manipulation must use the MigrationRecord class. This class is guaranteed to provide a correct connection in a context of a given migration.

Underneath the MigrationRecord == ActiveRecord::Base, as once the db:migrate runs, it switches the active connection of ActiveRecord::Base.establish_connection :ci. To avoid confusion to using the ActiveRecord::Base, MigrationRecord is required.

This implies that DML migrations are forbidden to read data from other databases. For example, running migration in context of ci: and reading feature flags from main:, as no established connection to another database is present.

class UpdateProjectsArchivedState < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  restrict_gitlab_migration gitlab_schema: :gitlab_main

  class Project < MigrationRecord
  end

  def up
    Project.where(archived: false).each_batch of |batch|
      batch.update_all(archived: true)
    end
  end

  def down
  end
end

The special purpose of gitlab_shared

As described in gitlab_schema, the gitlab_shared tables are allowed to contain data across all databases. This implies that such migrations should run across all databases to modify structure (DDL) or modify data (DML).

As such migrations accessing gitlab_shared do not need to use restrict_gitlab_migration gitlab_schema:, migrations without restriction run across all databases and are allowed to modify data on each of them. If the restrict_gitlab_migration gitlab_schema: is specified, the DML migration runs only in a context of a database containing the given gitlab_schema.

Example: run DML gitlab_shared migration on all databases

Example migration updating loose_foreign_keys_deleted_records table that is marked in lib/gitlab/database/gitlab_schemas.yml as gitlab_shared.

This migration is executed across all configured databases.

class DeleteAllLooseForeignKeyRecords < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  def up
    execute("DELETE FROM loose_foreign_keys_deleted_records")
  end

  def down
    # no-op
  end
end

Example: run DML gitlab_shared only on the database containing the given gitlab_schema

Example migration updating loose_foreign_keys_deleted_records table that is marked in lib/gitlab/database/gitlab_schemas.yml as gitlab_shared.

This migration since it configures restriction on gitlab_ci is executed only in context of database containing gitlab_ci schema.

class DeleteCiBuildsLooseForeignKeyRecords < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  restrict_gitlab_migration gitlab_schema: :gitlab_ci

  def up
    execute("DELETE FROM loose_foreign_keys_deleted_records WHERE fully_qualified_table_name='ci_builds'")
  end

  def down
    # no-op
  end
end

The behavior of skipping migrations

The only migrations that are skipped are the ones performing DML changes. The DDL migrations are always and unconditionally executed.

The implemented solution uses the database_tasks: as a way to indicate which additional database configurations (in config/database.yml) share the same primary database. The database configurations marked with database_tasks: false are exempt from executing db:migrate for those database configurations.

If database configurations do not share databases (all do have database_tasks: true), each migration runs for every database configuration:

  1. The DDL migration applies all structure changes on all databases.
  2. The DML migration runs only in the context of a database containing the given gitlab_schema:.
  3. If the DML migration is not eligible to run, it is skipped. It's still marked as executed in schema_migrations. While running db:migrate, the skipped migration outputs Current migration is skipped since it modifies 'gitlab_ci' which is outside of 'gitlab_main, gitlab_shared.

To prevent loss of migrations if the database_tasks: false is configured, a dedicated Rake task is used gitlab:db:validate_config. The gitlab:db:validate_config validates the correctness of database_tasks: by checking database identifiers of each underlying database configuration. The ones that share the database are required to have the database_tasks: false set. gitlab:db:validate_config always runs before db:migrate.

Validation

Validation in a nutshell uses pg_query to analyze each query and classify tables with information from gitlab_schema.yml. The migration is skipped if the specified gitlab_schema is outside of a list of schemas managed by a given database connection (Gitlab::Database::gitlab_schemas_for_connection).

The Gitlab::Database::Migration[2.0] includes Gitlab::Database::MigrationHelpers::RestrictGitlabSchema which extends the #migrate method. For the duration of a migration a dedicated query analyzer is installed Gitlab::Database::QueryAnalyzers::RestrictAllowedSchemas that accepts a list of allowed schemas as defined by restrict_gitlab_migration:. If the executed query is outside of allowed schemas, it raises an exception.

Exceptions

Depending on misuse or lack of restrict_gitlab_migration various exceptions can be raised as part of the migration run and prevent the migration from being completed.

Exception 1: migration running in DDL mode does DML select

class UpdateProjectsArchivedState < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  # Missing:
  # restrict_gitlab_migration gitlab_schema: :gitlab_main

  def up
    update_column_in_batches(:projects, :archived, true) do |table, query|
      query.where(table[:archived].eq(false)) # rubocop:disable CodeReuse/ActiveRecord
    end
  end

  def down
    # no-op
  end
end
Select/DML queries (SELECT/UPDATE/DELETE) are disallowed in the DDL (structure) mode
Modifying of 'projects' (gitlab_main) with 'SELECT * FROM projects...

The current migration do not use restrict_gitlab_migration. The lack indicates a migration running in DDL mode, but the executed payload appears to be reading data from projects.

The solution is to add restrict_gitlab_migration gitlab_schema: :gitlab_main.

Exception 2: migration running in DML mode changes the structure

class AddUserIdAndStateIndexToMergeRequestReviewers < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  # restrict_gitlab_migration if defined indicates DML, it should be removed
  restrict_gitlab_migration gitlab_schema: :gitlab_main

  INDEX_NAME = 'index_on_merge_request_reviewers_user_id_and_state'

  def up
    add_concurrent_index :merge_request_reviewers, [:user_id, :state], where: 'state = 2', name: INDEX_NAME
  end

  def down
    remove_concurrent_index_by_name :merge_request_reviewers, INDEX_NAME
  end
end
DDL queries (structure) are disallowed in the Select/DML (SELECT/UPDATE/DELETE) mode.
Modifying of 'merge_request_reviewers' with 'CREATE INDEX...

The current migration do use restrict_gitlab_migration. The presence indicates DML mode, but the executed payload appears to be doing structure changes (DDL).

The solution is to remove restrict_gitlab_migration gitlab_schema: :gitlab_main.

Exception 3: migration running in DML mode accesses data from a table in another schema

class UpdateProjectsArchivedState < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  # Since it modifies `projects` it should use `gitlab_main`
  restrict_gitlab_migration gitlab_schema: :gitlab_ci

  def up
    update_column_in_batches(:projects, :archived, true) do |table, query|
      query.where(table[:archived].eq(false)) # rubocop:disable CodeReuse/ActiveRecord
    end
  end

  def down
    # no-op
  end
end
Select/DML queries (SELECT/UPDATE/DELETE) do access 'projects' (gitlab_main) " \
which is outside of list of allowed schemas: 'gitlab_ci'

The current migration do restrict the migration to gitlab_ci, but appears to modify data in gitlab_main.

The solution is to change restrict_gitlab_migration gitlab_schema: :gitlab_ci.

Exception 4: mixing DDL and DML mode

class UpdateProjectsArchivedState < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  # This migration is invalid regardless of specification
  # as it cannot modify structure and data at the same time
  restrict_gitlab_migration gitlab_schema: :gitlab_ci

  def up
    add_concurrent_index :merge_request_reviewers, [:user_id, :state], where: 'state = 2', name: 'index_on_merge_request_reviewers'
    update_column_in_batches(:projects, :archived, true) do |table, query|
      query.where(table[:archived].eq(false)) # rubocop:disable CodeReuse/ActiveRecord
    end
  end

  def down
    # no-op
  end
end

The migrations mixing DDL and DML depending on ordering of operations raises one of the prior exceptions.

Upcoming changes on multiple database migrations

The restrict_gitlab_migration using gitlab_schema: is considered as a first iteration of this feature for running migrations selectively depending on a context. It is possible to add additional restrictions to DML-only migrations (as the structure coherency is likely to stay as-is until further notice) to restrict when they run.

A Potential extension is to limit running DML migration only to specific environments:

restrict_gitlab_migration gitlab_schema: :gitlab_main, gitlab_env: :gitlab_com

Background migrations

When you use:

  • Background migrations with track_jobs set to true or
  • Batched background migrations

The migration has to write to a jobs table. All of the jobs tables used by background migrations are marked as gitlab_shared. You can use these migrations when migrating tables in any database.

However, when queuing the batches, you must set restrict_gitlab_migration based on the table you are iterating over. If you are updating all projects, for example, then you would set restrict_gitlab_migration gitlab_schema: :gitlab_main. If, however, you are updating all ci_pipelines, you would set restrict_gitlab_migration gitlab_schema: :gitlab_ci.

As with all DML migrations, you cannot query another database outside of restrict_gitlab_migration or gitlab_shared. If you need to query another database, you'll likely need to separate these into two migrations somehow.

Because the actual migration logic (not the queueing step) for background migrations runs in a Sidekiq worker, the logic can perform DML queries on tables in any database, just like any ordinary Sidekiq worker can.

How to determine gitlab_schema for a given table

See GitLab Schema.