debian-mirror-gitlab/doc/development/migration_style_guide.md
2021-10-27 15:23:28 +05:30

43 KiB

stage group info
none unassigned To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments

Migration Style Guide

When writing migrations for GitLab, you have to take into account that these are run by hundreds of thousands of organizations of all sizes, some with many years of data in their database.

In addition, having to take a server offline for an upgrade small or big is a big burden for most organizations. For this reason, it is important that your migrations are written carefully, can be applied online, and adhere to the style guide below.

Migrations are not allowed to require GitLab installations to be taken offline ever. Migrations always must be written in such a way to avoid downtime. In the past we had a process for defining migrations that allowed for downtime by setting a DOWNTIME constant. You may see this when looking at older migrations. This process was in place for 4 years without ever being used and as such we've learned we can always figure out how to write a migration differently to avoid downtime.

When writing your migrations, also consider that databases might have stale data or inconsistencies and guard for that. Try to make as few assumptions as possible about the state of the database.

Please don't depend on GitLab-specific code since it can change in future versions. If needed copy-paste GitLab code into the migration to make it forward compatible.

Choose an appropriate migration type

The first step before adding a new migration should be to decide which type is most appropriate.

There are currently three kinds of migrations you can create, depending on the kind of work it needs to perform and how long it takes to complete:

  1. Regular schema migrations. These are traditional Rails migrations in db/migrate that run before new application code is deployed (for GitLab.com before Canary is deployed). This means that they should be relatively fast, no more than a few minutes, so as not to unnecessarily delay a deployment.

    One exception is a migration that takes longer but is absolutely critical for the application to operate correctly. For example, you might have indices that enforce unique tuples, or that are needed for query performance in critical parts of the application. In cases where the migration would be unacceptably slow, however, a better option might be to guard the feature with a feature flag and perform a post-deployment migration instead. The feature can then be turned on after the migration finishes.

  2. Post-deployment migrations. These are Rails migrations in db/post_migrate and run after new application code has been deployed (for GitLab.com after the production deployment has finished). They can be used for schema changes that aren't critical for the application to operate, or data migrations that take at most a few minutes. Common examples for schema changes that should run post-deploy include:

    • Clean-ups, like removing unused columns.
    • Adding non-critical indices on high-traffic tables.
    • Adding non-critical indices that take a long time to create.
  3. Background migrations. These aren't regular Rails migrations, but application code that is executed via Sidekiq jobs, although a post-deployment migration is used to schedule them. Use them only for data migrations that exceed the timing guidelines for post-deploy migrations. Background migrations should not change the schema.

Use the following diagram to guide your decision, but keep in mind that it is just a tool, and the final outcome will always be dependent on the specific changes being made:

graph LR
    A{Schema<br/>changed?}
    A -->|Yes| C{Critical to<br/>speed or<br/>behavior?}
    A -->|No| D{Is it fast?}

    C -->|Yes| H{Is it fast?}
    C -->|No| F[Post-deploy migration]

    H -->|Yes| E[Regular migration]
    H -->|No| I[Post-deploy migration<br/>+ feature flag]
 
    D -->|Yes| F[Post-deploy migration]
    D -->|No| G[Background migration]

How long a migration should take

In general, all migrations for a single deploy shouldn't take longer than 1 hour for GitLab.com. The following guidelines are not hard rules, they were estimated to keep migration duration to a minimum.

NOTE: Keep in mind that all durations should be measured against GitLab.com.

Migration Type Recommended Duration Notes
Regular migrations <= 3 minutes A valid exception are changes without which application functionality or performance would be severely degraded and which cannot be delayed.
Post-deployment migrations <= 10 minutes A valid exception are schema changes, since they must not happen in background migrations.
Background migrations > 10 minutes Since these are suitable for larger tables, it's not possible to set a precise timing guideline, however, any single query must stay below 1 second execution time with cold caches.

Create a regular schema migration

To create a migration you can use the following Rails generator:

bundle exec rails g migration migration_name_here

This generates the migration file in db/migrate.

Schema Changes

Changes to the schema should be committed to db/structure.sql. This file is automatically generated by Rails when you run bundle exec rails db:migrate, so you normally should not edit this file by hand. If your migration is adding a column to a table, that column is added at the bottom. Please do not reorder columns manually for existing tables as this causes confusion to other people using db/structure.sql generated by Rails.

When your local database in your GDK is diverging from the schema from main it might be hard to cleanly commit the schema changes to Git. In that case you can use the scripts/regenerate-schema script to regenerate a clean db/structure.sql for the migrations you're adding. This script applies all migrations found in db/migrate or db/post_migrate, so if there are any migrations you don't want to commit to the schema, rename or remove them. If your branch is not targeting main you can set the TARGET environment variable.

# Regenerate schema against `main`
scripts/regenerate-schema

# Regenerate schema against `12-9-stable-ee`
TARGET=12-9-stable-ee scripts/regenerate-schema

Avoiding downtime

The document "Avoiding downtime in migrations" specifies various database operations, such as:

and explains how to perform them without requiring downtime.

Reversibility

Your migration must be reversible. This is very important, as it should be possible to downgrade in case of a vulnerability or bugs.

In your migration, add a comment describing how the reversibility of the migration was tested.

Some migrations cannot be reversed. For example, some data migrations can't be reversed because we lose information about the state of the database before the migration. You should still create a down method with a comment, explaining why the changes performed by the up method can't be reversed, so that the migration itself can be reversed, even if the changes performed during the migration can't be reversed:

def down
  # no-op

  # comment explaining why changes performed by `up` cannot be reversed.
end

Atomicity

By default, migrations are single transaction. That is, a transaction is opened at the beginning of the migration, and committed after all steps are processed.

Running migrations in a single transaction makes sure that if one of the steps fails, none of the steps are executed, leaving the database in valid state. Therefore, either:

  • Put all migrations in one single-transaction migration.
  • If necessary, put most actions in one migration and create a separate migration for the steps that cannot be done in a single transaction.

For example, if you create an empty table and need to build an index for it, it is recommended to use a regular single-transaction migration and the default rails schema statement: add_index. This is a blocking operation, but it doesn't cause problems because the table is not yet used, and therefore it does not have any records yet.

Naming conventions

We keep column names consistent with ActiveRecord's schema conventions.

Custom index names should follow the pattern index_#{table_name}_on_#{column_1}_and_#{column_2}_#{condition}.

Examples:

  • index_services_on_type_and_id_and_template_when_active
  • index_projects_on_id_service_desk_enabled
  • index_clusters_on_enabled_cluster_type_id_and_created_at

Truncate long index names

PostgreSQL limits the length of identifiers, like column or index names. Column names are not usually a problem, but index names tend to be longer. Some methods for shortening a name that's too long:

  • Prefix it with i_ instead of index_.
  • Skip redundant prefixes. For example, index_vulnerability_findings_remediations_on_vulnerability_remediation_id becomes index_vulnerability_findings_remediations_on_remediation_id.
  • Instead of columns, specify the purpose of the index, such as index_users_for_unconfirmation_notification.

Heavy operations in a single transaction

When using a single-transaction migration, a transaction holds a database connection for the duration of the migration, so you must make sure the actions in the migration do not take too much time: GitLab.com's production database has a 15s timeout, so in general, the cumulative execution time in a migration should aim to fit comfortably in that limit. Singular query timings should fit within the standard limit

In case you need to insert, update, or delete a significant amount of data, you:

  • Must disable the single transaction with disable_ddl_transaction!.
  • Should consider doing it in a Background Migration.

Retry mechanism when acquiring database locks

When changing the database schema, we use helper methods to invoke DDL (Data Definition Language) statements. In some cases, these DDL statements require a specific database lock.

Example:

def change
  remove_column :users, :full_name, :string
end

Executing this migration requires an exclusive lock on the users table. When the table is concurrently accessed and modified by other processes, acquiring the lock may take a while. The lock request is waiting in a queue and it may also block other queries on the users table once it has been enqueued.

More information about PostgresSQL locks: Explicit Locking

For stability reasons, GitLab.com has a specific statement_timeout set. When the migration is invoked, any database query has a fixed time to execute. In a worst-case scenario, the request sits in the lock queue, blocking other queries for the duration of the configured statement timeout, then failing with canceling statement due to statement timeout error.

This problem could cause failed application upgrade processes and even application stability issues, since the table may be inaccessible for a short period of time.

To increase the reliability and stability of database migrations, the GitLab codebase offers a helper method to retry the operations with different lock_timeout settings and wait time between the attempts. Multiple smaller attempts to acquire the necessary lock allow the database to process other statements.

Examples

Removing a column:

include Gitlab::Database::MigrationHelpers

def up
  with_lock_retries do
    remove_column :users, :full_name
  end
end

def down
  with_lock_retries do
    add_column :users, :full_name, :string
  end
end

Multiple changes on the same table:

The helper with_lock_retries wraps all operations into a single transaction. When you have the lock, you should do as much as possible inside the transaction rather than trying to get another lock later. Be careful about running long database statements within the block. The acquired locks are kept until the transaction (block) finishes and depending on the lock type, it might block other database operations.

include Gitlab::Database::MigrationHelpers

def up
  with_lock_retries do
    add_column :users, :full_name, :string
    add_column :users, :bio, :string
  end
end

def down
  with_lock_retries do
    remove_column :users, :full_name
    remove_column :users, :bio
  end
end

Removing a foreign key:

include Gitlab::Database::MigrationHelpers

def up
  with_lock_retries do
    remove_foreign_key :issues, :projects
  end
end

def down
  with_lock_retries do
    add_foreign_key :issues, :projects
  end
end

Changing default value for a column:

include Gitlab::Database::MigrationHelpers

def up
  with_lock_retries do
    change_column_default :merge_requests, :lock_version, from: nil, to: 0
  end
end

def down
  with_lock_retries do
    change_column_default :merge_requests, :lock_version, from: 0, to: nil
  end
end

Creating a new table with a foreign key:

We can wrap the create_table method with with_lock_retries:

def up
  with_lock_retries do
    create_table :issues do |t|
      t.references :project, index: true, null: false, foreign_key: { on_delete: :cascade }
      t.string :title, limit: 255
    end
  end
end

def down
  with_lock_retries do
    drop_table :issues
  end
end

Creating a new table when we have two foreign keys:

Only one foreign key should be created per migration. This is because the addition of a foreign key constraint requires a SHARE ROW EXCLUSIVE lock on the referenced table, and locking multiple tables in the same transaction should be avoided.

For this, we need three migrations:

  1. Creating the table without foreign keys (with the indices).
  2. Add foreign key to the first table.
  3. Add foreign key to the second table.

Creating the table:

def up
  create_table :imports do |t|
    t.bigint :project_id, null: false
    t.bigint :user_id, null: false
    t.string :jid, limit: 255

    t.index :project_id
    t.index :user_id
  end
end

def down
  drop_table :imports
end

Adding foreign key to projects:

We can use the add_concurrent_foreign_key method in this case, as this helper method has the lock retries built into it.

include Gitlab::Database::MigrationHelpers

disable_ddl_transaction!

def up
  add_concurrent_foreign_key :imports, :projects, column: :project_id, on_delete: :cascade
end

def down
  with_lock_retries do
    remove_foreign_key :imports, column: :project_id
  end
end

Adding foreign key to users:

include Gitlab::Database::MigrationHelpers

disable_ddl_transaction!

def up
  add_concurrent_foreign_key :imports, :users, column: :user_id, on_delete: :cascade
end

def down
  with_lock_retries do
    remove_foreign_key :imports, column: :user_id
  end
end

Usage with disable_ddl_transaction!

Generally the with_lock_retries helper should work with disable_ddl_transaction!. A custom RuboCop rule ensures that only allowed methods can be placed within the lock retries block.

disable_ddl_transaction!

def up
  with_lock_retries do
    add_column :users, :name, :text
  end

  add_text_limit :users, :name, 255 # Includes constraint validation (full table scan)
end

The RuboCop rule generally allows standard Rails migration methods, listed below. This example causes a Rubocop offense:

disable_ddl_transaction!

def up
  with_lock_retries do
    add_concurrent_index :users, :name
  end
end

When to use the helper method

The with_lock_retries helper method can be used when you normally use standard Rails migration helper methods. Calling more than one migration helper is not a problem if they're executed on the same table.

Using the with_lock_retries helper method is advised when a database migration involves one of the high-traffic tables.

Example changes:

  • add_foreign_key / remove_foreign_key
  • add_column / remove_column
  • change_column_default
  • create_table / drop_table

The with_lock_retries method cannot be used within the change method, you must manually define the up and down methods to make the migration reversible.

How the helper method works

  1. Iterate 50 times.
  2. For each iteration, set a pre-configured lock_timeout.
  3. Try to execute the given block. (remove_column).
  4. If LockWaitTimeout error is raised, sleep for the pre-configured sleep_time and retry the block.
  5. If no error is raised, the current iteration has successfully executed the block.

For more information check the Gitlab::Database::WithLockRetries class. The with_lock_retries helper method is implemented in the Gitlab::Database::MigrationHelpers module.

In a worst-case scenario, the method:

  • Executes the block for a maximum of 50 times over 40 minutes.
    • Most of the time is spent in a pre-configured sleep period after each iteration.
  • After the 50th retry, the block is executed without lock_timeout, just like a standard migration invocation.
  • If a lock cannot be acquired, the migration fails with statement timeout error.

The migration might fail if there is a very long running transaction (40+ minutes) accessing the users table.

Removing indexes

If the table is not empty when removing an index, make sure to use the method remove_concurrent_index instead of the regular remove_index method. The remove_concurrent_index method drops indexes concurrently, so no locking is required, and there is no need for downtime. To use this method, you must disable single-transaction mode by calling the method disable_ddl_transaction! in the body of your migration class like so:

class MyMigration < ActiveRecord::Migration[6.0]
  include Gitlab::Database::MigrationHelpers
  disable_ddl_transaction!

  INDEX_NAME = 'index_name'
  def up
    remove_concurrent_index :table_name, :column_name, name: INDEX_NAME
  end
end

Note that it is not necessary to check if the index exists prior to removing it, however it is required to specify the name of the index that is being removed. This can be done either by passing the name as an option to the appropriate form of remove_index or remove_concurrent_index, or more simply by using the remove_concurrent_index_by_name method. Explicitly specifying the name is important to ensure the correct index is removed.

For a small table (such as an empty one or one with less than 1,000 records), it is recommended to use remove_index in a single-transaction migration, combining it with other operations that don't require disable_ddl_transaction!.

Disabling an index

There are certain situations in which you might want to disable an index before removing it. See the maintenance operations guide for more details.

Adding indexes

Before adding an index, consider if this one is necessary. There are situations in which an index might not be required, like:

  • The table is small (less than 1,000 records) and it's not expected to exponentially grow in size.
  • Any existing indexes filter out enough rows.
  • The reduction in query timings after the index is added is not significant.

Additionally, wide indexes are not required to match all filter criteria of queries, we just need to cover enough columns so that the index lookup has a small enough selectivity. Please review our Adding Database indexes guide for more details.

When adding an index to a non-empty table make sure to use the method add_concurrent_index instead of the regular add_index method. The add_concurrent_index method automatically creates concurrent indexes when using PostgreSQL, removing the need for downtime.

To use this method, you must disable single-transactions mode by calling the method disable_ddl_transaction! in the body of your migration class like so:

class MyMigration < ActiveRecord::Migration[6.0]
  include Gitlab::Database::MigrationHelpers

  disable_ddl_transaction!

  INDEX_NAME = 'index_name'

  def up
    add_concurrent_index :table, :column, name: INDEX_NAME
  end

  def down
    remove_concurrent_index :table, :column, name: INDEX_NAME
  end
end

You must explicitly name indexes that are created with more complex definitions beyond table name, column name(s) and uniqueness constraint. Consult the Adding Database Indexes guide for more details.

If you need to add a unique index, please keep in mind there is the possibility of existing duplicates being present in the database. This means that should always first add a migration that removes any duplicates, before adding the unique index.

For a small table (such as an empty one or one with less than 1,000 records), it is recommended to use add_index in a single-transaction migration, combining it with other operations that don't require disable_ddl_transaction!.

Testing for existence of indexes

If a migration requires conditional logic based on the absence or presence of an index, you must test for existence of that index using its name. This helps avoids problems with how Rails compares index definitions, which can lead to unexpected results. For more details, review the Adding Database Indexes guide.

The easiest way to test for existence of an index by name is to use the index_name_exists? method, but the index_exists? method can also be used with a name option. For example:

class MyMigration < ActiveRecord::Migration[6.0]
  include Gitlab::Database::MigrationHelpers

  INDEX_NAME = 'index_name'

  def up
    # an index must be conditionally created due to schema inconsistency
    unless index_exists?(:table_name, :column_name, name: INDEX_NAME)
      add_index :table_name, :column_name, name: INDEX_NAME
    end
  end

  def down
    # no op
  end
end

Keep in mind that concurrent index helpers like add_concurrent_index, remove_concurrent_index, and remove_concurrent_index_by_name already perform existence checks internally.

Adding foreign-key constraints

When adding a foreign-key constraint to either an existing or a new column also remember to add an index on the column.

This is required for all foreign-keys, for example, to support efficient cascading deleting: when a lot of rows in a table get deleted, the referenced records need to be deleted too. The database has to look for corresponding records in the referenced table. Without an index, this results in a sequential scan on the table, which can take a long time.

Here's an example where we add a new column with a foreign key constraint. Note it includes index: true to create an index for it.

class Migration < ActiveRecord::Migration[6.0]

  def change
    add_reference :model, :other_model, index: true, foreign_key: { on_delete: :cascade }
  end
end

When adding a foreign-key constraint to an existing column in a non-empty table, we have to employ add_concurrent_foreign_key and add_concurrent_index instead of add_reference.

If you have a new or empty table that doesn't reference a high-traffic table, we recommend that you use add_reference in a single-transaction migration. You can combine it with other operations that don't require disable_ddl_transaction!.

You can read more about adding foreign key constraints to an existing column.

NOT NULL constraints

Introduced in GitLab 13.0.

See the style guide on NOT NULL constraints for more information.

Adding Columns With Default Values

With PostgreSQL 11 being the minimum version in GitLab 13.0 and later, adding columns with default values has become much easier and the standard add_column helper should be used in all cases.

Before PostgreSQL 11, adding a column with a default was problematic as it would have caused a full table rewrite. The corresponding helper add_column_with_default has been deprecated and is scheduled to be removed in a later release.

If a backport adding a column with a default value is needed for %12.9 or earlier versions, it should use add_column_with_default helper. If a large table is involved, backporting to %12.9 is contraindicated.

Changing the column default

One might think that changing a default column with change_column_default is an expensive and disruptive operation for larger tables, but in reality it's not.

Take the following migration as an example:

class DefaultRequestAccessGroups < ActiveRecord::Migration[5.2]
  def change
    change_column_default(:namespaces, :request_access_enabled, from: false, to: true)
  end
end

Migration above changes the default column value of one of our largest tables: namespaces. This can be translated to:

ALTER TABLE namespaces
ALTER COLUMN request_access_enabled
SET DEFAULT false

In this particular case, the default value exists and we're just changing the metadata for request_access_enabled column, which does not imply a rewrite of all the existing records in the namespaces table. Only when creating a new column with a default, all the records are going be rewritten.

NOTE: A faster ALTER TABLE ADD COLUMN with a non-null default was introduced on PostgresSQL 11.0, removing the need of rewriting the table when a new column with a default value is added.

For the reasons mentioned above, it's safe to use change_column_default in a single-transaction migration without requiring disable_ddl_transaction!.

Updating an existing column

To update an existing column to a particular value, you can use update_column_in_batches. This splits the updates into batches, so we don't update too many rows at in a single statement.

This updates the column foo in the projects table to 10, where some_column is 'hello':

update_column_in_batches(:projects, :foo, 10) do |table, query|
  query.where(table[:some_column].eq('hello'))
end

If a computed update is needed, the value can be wrapped in Arel.sql, so Arel treats it as an SQL literal. It's also a required deprecation for Rails 6.

The below example is the same as the one above, but the value is set to the product of the bar and baz columns:

update_value = Arel.sql('bar * baz')

update_column_in_batches(:projects, :foo, update_value) do |table, query|
  query.where(table[:some_column].eq('hello'))
end

Like add_column_with_default, there is a RuboCop cop to detect usage of this on large tables. In the case of update_column_in_batches, it may be acceptable to run on a large table, as long as it is only updating a small subset of the rows in the table, but do not ignore that without validating on the GitLab.com staging environment - or asking someone else to do so for you - beforehand.

Dropping a database table

Dropping a database table is uncommon, and the drop_table method provided by Rails is generally considered safe. Before dropping the table, please consider the following:

If your table has foreign keys on a high-traffic table (like projects), then the DROP TABLE statement is likely to stall concurrent traffic until it fails with statement timeout error.

Table has no records (feature was never in use) and no foreign keys:

  • Use the drop_table method in your migration.
def change
  drop_table :my_table
end

Table has records but no foreign keys:

  • First release: Remove the application code related to the table, such as models, controllers and services.
  • Second release: Use the drop_table method in your migration.
def up
  drop_table :my_table
end

def down
  # create_table ...
end

Table has foreign keys:

  • First release: Remove the application code related to the table, such as models, controllers, and services.
  • Second release: Remove the foreign keys using the with_lock_retries helper method. Use drop_table in another migration file.

Migrations for the second release:

Removing the foreign key on the projects table:

# first migration file

def up
  with_lock_retries do
    remove_foreign_key :my_table, :projects
  end
end

def down
  with_lock_retries do
    add_foreign_key :my_table, :projects
  end
end

Dropping the table:

# second migration file

def up
  drop_table :my_table
end

def down
  # create_table ...
end

Integer column type

By default, an integer column can hold up to a 4-byte (32-bit) number. That is a max value of 2,147,483,647. Be aware of this when creating a column that holds file sizes in byte units. If you are tracking file size in bytes, this restricts the maximum file size to just over 2GB.

To allow an integer column to hold up to an 8-byte (64-bit) number, explicitly set the limit to 8-bytes. This allows the column to hold a value up to 9,223,372,036,854,775,807.

Rails migration example:

add_column(:projects, :foo, :integer, default: 10, limit: 8)

Strings and the Text data type

Introduced in GitLab 13.0.

See the text data type style guide for more information.

Timestamp column type

By default, Rails uses the timestamp data type that stores timestamp data without timezone information. The timestamp data type is used by calling either the add_timestamps or the timestamps method.

Also, Rails converts the :datetime data type to the timestamp one.

Example:

# timestamps
create_table :users do |t|
  t.timestamps
end

# add_timestamps
def up
  add_timestamps :users
end

# :datetime
def up
  add_column :users, :last_sign_in, :datetime
end

Instead of using these methods, one should use the following methods to store timestamps with timezones:

  • add_timestamps_with_timezone
  • timestamps_with_timezone
  • datetime_with_timezone

This ensures all timestamps have a time zone specified. This, in turn, means existing timestamps don't suddenly use a different timezone when the system's timezone changes. It also makes it very clear which timezone was used in the first place.

Storing JSON in database

The Rails 5 natively supports JSONB (binary JSON) column type. Example migration adding this column:

class AddOptionsToBuildMetadata < ActiveRecord::Migration[5.0]
  def change
    add_column :ci_builds_metadata, :config_options, :jsonb
  end
end

You have to use a serializer to provide a translation layer:

class BuildMetadata
  serialize :config_options, Serializers::Json # rubocop:disable Cop/ActiveRecordSerialize
end

When using a JSONB column, use the JsonSchemaValidator to keep control of the data being inserted over time.

class BuildMetadata
  validates :config_options, json_schema: { filename: 'build_metadata_config_option' }
end

Encrypted attributes

Introduced in GitLab 14.0.

Do not store attr_encrypted attributes as :text in the database; use :binary instead. This uses the bytea type in PostgreSQL and makes storage more efficient:

class AddSecretToSomething < ActiveRecord::Migration[5.0]
  def change
    add_column :something, :encrypted_secret, :binary
    add_column :something, :encrypted_secret_iv, :binary
  end
end

When storing encrypted attributes in a binary column, we need to provide the encode: false and encode_iv: false options to attr_encrypted:

class Something < ApplicationRecord
  attr_encrypted :secret,
    mode: :per_attribute_iv,
    key: Settings.attr_encrypted_db_key_base_32,
    algorithm: 'aes-256-gcm',
    encode: false,
    encode_iv: false
end

Testing

See the Testing Rails migrations style guide.

Data migration

Please prefer Arel and plain SQL over usual ActiveRecord syntax. In case of using plain SQL, you need to quote all input manually with quote_string helper.

Example with Arel:

users = Arel::Table.new(:users)
users.group(users[:user_id]).having(users[:id].count.gt(5))

#update other tables with these results

Example with plain SQL and quote_string helper:

select_all("SELECT name, COUNT(id) as cnt FROM tags GROUP BY name HAVING COUNT(id) > 1").each do |tag|
  tag_name = quote_string(tag["name"])
  duplicate_ids = select_all("SELECT id FROM tags WHERE name = '#{tag_name}'").map{|tag| tag["id"]}
  origin_tag_id = duplicate_ids.first
  duplicate_ids.delete origin_tag_id

  execute("UPDATE taggings SET tag_id = #{origin_tag_id} WHERE tag_id IN(#{duplicate_ids.join(",")})")
  execute("DELETE FROM tags WHERE id IN(#{duplicate_ids.join(",")})")
end

If you need more complex logic, you can define and use models local to a migration. For example:

class MyMigration < ActiveRecord::Migration[6.0]
  class Project < ActiveRecord::Base
    self.table_name = 'projects'
  end

  def up
    # Reset the column information of all the models that update the database
    # to ensure the Active Record's knowledge of the table structure is current
    Project.reset_column_information

    # ... ...
  end
end

When doing so be sure to explicitly set the model's table name, so it's not derived from the class name or namespace.

Be aware of the limitations when using models in migrations.

Modifying existing data

In most circumstances, prefer migrating data in batches when modifying data in the database.

We introduced a new helper each_batch_range which facilitates the process of iterating over a collection in a performant way. The default size of the batch is defined in the BATCH_SIZE constant.

See the following example to get an idea.

Purging data in batch:

include ::Gitlab::Database::DynamicModelHelpers

disable_ddl_transaction!

def up
  each_batch_range('ci_pending_builds', scope: ->(table) { table.ref_protected }, of: BATCH_SIZE) do |min, max|
    execute <<~SQL
      DELETE FROM ci_pending_builds
        USING ci_builds
        WHERE ci_builds.id = ci_pending_builds.build_id
          AND ci_builds.status != 'pending'
          AND ci_builds.type = 'Ci::Build'
          AND ci_pending_builds.id BETWEEN #{min} AND #{max}
    SQL
  end
end
  • The first argument is the table being modified: 'ci_pending_builds'.
  • The second argument calls a lambda which fetches the relevant dataset selected (the default is set to .all): scope: ->(table) { table.ref_protected }.
  • The third argument is the batch size (the default is set in the BATCH_SIZE constant): of: BATCH_SIZE.

Here is an example MR illustrating how to use our new helper.

Renaming reserved paths

When a new route for projects is introduced, it could conflict with any existing records. The path for these records should be renamed, and the related data should be moved on disk.

Since we had to do this a few times already, there are now some helpers to help with this.

To use this you can include Gitlab::Database::RenameReservedPathsMigration::V1 in your migration. This provides 3 methods which you can pass one or more paths that need to be rejected.

  • rename_root_paths: Renames the path of all namespaces with the given name that don't have a parent_id.
  • rename_child_paths: Renames the path of all namespaces with the given name that have a parent_id.
  • rename_wildcard_paths: Renames the path of all projects, and all namespaces that have a project_id.

The path column for these rows are renamed to their previous value followed by an integer. For example: users would turn into users0

Using models in migrations (discouraged)

The use of models in migrations is generally discouraged. As such models are contraindicated for background migrations, the model needs to be declared in the migration.

If using a model in the migrations, you should first clear the column cache using reset_column_information.

If using a model that leverages single table inheritance (STI), there are special considerations.

This avoids problems where a column that you are using was altered and cached in a previous migration.

Example: Add a column my_column to the users table

It is important not to leave out the User.reset_column_information command, in order to ensure that the old schema is dropped from the cache and ActiveRecord loads the updated schema information.

class AddAndSeedMyColumn < ActiveRecord::Migration[6.0]
  class User < ActiveRecord::Base
    self.table_name = 'users'
  end

  def up
    User.count # Any ActiveRecord calls on the model that caches the column information.

    add_column :users, :my_column, :integer, default: 1

    User.reset_column_information # The old schema is dropped from the cache.
    User.find_each do |user|
      user.my_column = 42 if some_condition # ActiveRecord sees the correct schema here.
      user.save!
    end
  end
end

The underlying table is modified and then accessed via ActiveRecord.

Note that this also needs to be used if the table is modified in a previous, different migration, if both migrations are run in the same db:migrate process.

This results in the following. Note the inclusion of my_column:

== 20200705232821 AddAndSeedMyColumn: migrating ==============================
D, [2020-07-06T00:37:12.483876 #130101] DEBUG -- :    (0.2ms)  BEGIN
D, [2020-07-06T00:37:12.521660 #130101] DEBUG -- :    (0.4ms)  SELECT COUNT(*) FROM "user"
-- add_column(:users, :my_column, :integer, {:default=>1})
D, [2020-07-06T00:37:12.523309 #130101] DEBUG -- :    (0.8ms)  ALTER TABLE "users" ADD "my_column" integer DEFAULT 1
   -> 0.0016s
D, [2020-07-06T00:37:12.650641 #130101] DEBUG -- :   AddAndSeedMyColumn::User Load (0.7ms)  SELECT "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT $1  [["LIMIT", 1000]]
D, [2020-07-18T00:41:26.851769 #459802] DEBUG -- :   AddAndSeedMyColumn::User Update (1.1ms)  UPDATE "users" SET "my_column" = $1, "updated_at" = $2 WHERE "users"."id" = $3  [["my_column", 42], ["updated_at", "2020-07-17 23:41:26.849044"], ["id", 1]]
D, [2020-07-06T00:37:12.653648 #130101] DEBUG -- :   ↳ config/initializers/config_initializers_active_record_locking.rb:13:in `_update_row'
== 20200705232821 AddAndSeedMyColumn: migrated (0.1706s) =====================

If you skip clearing the schema cache (User.reset_column_information), the column is not used by ActiveRecord and the intended changes are not made, leading to the result below, where my_column is missing from the query.

== 20200705232821 AddAndSeedMyColumn: migrating ==============================
D, [2020-07-06T00:37:12.483876 #130101] DEBUG -- :    (0.2ms)  BEGIN
D, [2020-07-06T00:37:12.521660 #130101] DEBUG -- :    (0.4ms)  SELECT COUNT(*) FROM "user"
-- add_column(:users, :my_column, :integer, {:default=>1})
D, [2020-07-06T00:37:12.523309 #130101] DEBUG -- :    (0.8ms)  ALTER TABLE "users" ADD "my_column" integer DEFAULT 1
   -> 0.0016s
D, [2020-07-06T00:37:12.650641 #130101] DEBUG -- :   AddAndSeedMyColumn::User Load (0.7ms)  SELECT "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT $1  [["LIMIT", 1000]]
D, [2020-07-06T00:37:12.653459 #130101] DEBUG -- :   AddAndSeedMyColumn::User Update (0.5ms)  UPDATE "users" SET "updated_at" = $1 WHERE "users"."id" = $2  [["updated_at", "2020-07-05 23:37:12.652297"], ["id", 1]]
D, [2020-07-06T00:37:12.653648 #130101] DEBUG -- :   ↳ config/initializers/config_initializers_active_record_locking.rb:13:in `_update_row'
== 20200705232821 AddAndSeedMyColumn: migrated (0.1706s) =====================

High traffic tables

Here's a list of current high-traffic tables.

Determining what tables are high-traffic can be difficult. Self-managed instances might use different features of GitLab with different usage patterns, thus making assumptions based on GitLab.com not enough.

To identify a high-traffic table for GitLab.com the following measures are considered. Note that the metrics linked here are GitLab-internal only:

Any table which has some high read operation compared to current high-traffic tables might be a good candidate.

As a general rule, we discourage adding columns to high-traffic tables that are purely for analytics or reporting of GitLab.com. This can have negative performance impacts for all self-managed instances without providing direct feature value to them.