debian-mirror-gitlab/doc/administration/pseudonymizer.md

---
stage: Enablement
group: Distribution
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---

# Pseudonymizer **(ULTIMATE)**

Your GitLab database contains sensitive information. To protect sensitive information
when you run analytics on your database, you can use the Pseudonymizer service, which:

1. Uses `HMAC(SHA256)` to mutate fields containing sensitive information.
1. Preserves references (referential integrity) between fields.
1. Exports your GitLab data, scrubbed of sensitive material.

WARNING:
If the source data is available, users can compare and correlate the scrubbed data
with the original.

To generate a pseudonymized data set:

1. [Configure Pseudonymizer](#configure-pseudonymizer) fields and output location.
1. [Enable Pseudonymizer data collection](#enable-pseudonymizer-data-collection).
1. Optional. [Generate a data set manually](#generate-data-set-manually).

## Configure Pseudonymizer

To use the Pseudonymizer, configure both the fields you want to anonymize, and the location to
store the scrubbed data:

1. **Create a manifest file**: This file describes the fields to include or pseudonymize.
   - **Default manifest** - GitLab provides a default manifest in your GitLab installation
     ([example `manifest.yml` file](https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/pseudonymizer.yml)).
     To use the example manifest file, use the `config/pseudonymizer.yml` relative path
     when you configure connection parameters.
   - **Custom manifest** - To use a custom manifest file, use the absolute path to
   the file when you configure the connection parameters.
1. **Configure connection parameters**: In the configuration method appropriate for
   your version of GitLab, specify the [object storage](object_storage.md)
   connection parameters (`pseudonymizer.upload.connection`).

**For Omnibus installations:**

1. Edit `/etc/gitlab/gitlab.rb` and add the following lines by replacing with
   the values you want:

   ```ruby
   gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml'
   gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name
   gitlab_rails['pseudonymizer_upload_connection'] = {
     'provider' => 'AWS',
     'region' => 'eu-central-1',
     'aws_access_key_id' => 'AWS_ACCESS_KEY_ID',
     'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY'
   }
   ```

   If you are using AWS IAM profiles, omit the AWS access key and secret access key/value pairs.

   ```ruby
   gitlab_rails['pseudonymizer_upload_connection'] = {
     'provider' => 'AWS',
     'region' => 'eu-central-1',
     'use_iam_profile' => true
   }
   ```

1. Save the file and [reconfigure GitLab](restart_gitlab.md#omnibus-gitlab-reconfigure)
   for the changes to take effect.

---

**For installations from source:**

1. Edit `/home/git/gitlab/config/gitlab.yml` and add or amend the following
   lines:

   ```yaml
   pseudonymizer:
     manifest: config/pseudonymizer.yml
     upload:
       remote_directory: 'gitlab-elt' # bucket name
       connection:
         provider: AWS
         aws_access_key_id: AWS_ACCESS_KEY_ID
         aws_secret_access_key: AWS_SECRET_ACCESS_KEY
         region: eu-central-1
   ```

1. Save the file and [restart GitLab](restart_gitlab.md#installations-from-source)
   for the changes to take effect.

## Enable Pseudonymizer data collection

To enable data collection:

1. On the top bar, select **Menu > Admin**.
1. On the left sidebar, select **Settings > Metrics and Profiling**, then expand
   **Pseudonymizer data collection**.
1. Select **Enable Pseudonymizer data collection**.
1. Select **Save changes**.

## Generate data set manually

You can also run the Pseudonymizer manually:

1. Set these environment variables:
   - `PSEUDONYMIZER_OUTPUT_DIR` - Where to store the output CSV files. Defaults to `/tmp`.
     These commands produce CSV files that can be quite large. Make sure the directory
     can store a file at least 10% of the size of your database.
   - `PSEUDONYMIZER_BATCH` - The batch size when querying the database. Defaults to `100000`.
1. Run the command appropriate for your application:
   - **Omnibus GitLab**:
     `sudo gitlab-rake gitlab:db:pseudonymizer`
   - **Installations from source**:
     `sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production`

After you run the command, upload the output CSV files to your configured object
storage. After the upload completes, delete the output file from the local disk.

## Related topics

- [Using object storage with GitLab](object_storage.md).
New upstream version 13.6.5 2021-01-29 00:20:46 +05:30			`---`
New upstream version 13.7.7 2021-02-22 17:27:13 +05:30			`stage: Enablement`
			`group: Distribution`
			`info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments`
New upstream version 13.6.5 2021-01-29 00:20:46 +05:30			`---`

New upstream version 12.1.11 2019-09-30 21:07:59 +05:30			`# Pseudonymizer (ULTIMATE)`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`Your GitLab database contains sensitive information. To protect sensitive information`
			`when you run analytics on your database, you can use the Pseudonymizer service, which:`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			1. Uses `HMAC(SHA256)` to mutate fields containing sensitive information.
			`1. Preserves references (referential integrity) between fields.`
			`1. Exports your GitLab data, scrubbed of sensitive material.`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 13.7.7 2021-02-22 17:27:13 +05:30			`WARNING:`
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`If the source data is available, users can compare and correlate the scrubbed data`
			`with the original.`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`To generate a pseudonymized data set:`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`1. [Configure Pseudonymizer](#configure-pseudonymizer) fields and output location.`
			`1. [Enable Pseudonymizer data collection](#enable-pseudonymizer-data-collection).`
			`1. Optional. [Generate a data set manually](#generate-data-set-manually).`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`## Configure Pseudonymizer`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`To use the Pseudonymizer, configure both the fields you want to anonymize, and the location to`
			`store the scrubbed data:`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`1. Create a manifest file: This file describes the fields to include or pseudonymize.`
			`- Default manifest - GitLab provides a default manifest in your GitLab installation`
			([example `manifest.yml` file](https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/pseudonymizer.yml)).
			To use the example manifest file, use the `config/pseudonymizer.yml` relative path
			`when you configure connection parameters.`
			`- Custom manifest - To use a custom manifest file, use the absolute path to`
			`the file when you configure the connection parameters.`
			`1. Configure connection parameters: In the configuration method appropriate for`
			`your version of GitLab, specify the [object storage](object_storage.md)`
			connection parameters (`pseudonymizer.upload.connection`).
New upstream version 12.10.0 2020-04-22 19:07:51 +05:30
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30			`For Omnibus installations:`

			1. Edit `/etc/gitlab/gitlab.rb` and add the following lines by replacing with
			`the values you want:`

New upstream version 12.1.11 2019-09-30 21:07:59 +05:30			```ruby
			`gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml'`
			`gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name`
			`gitlab_rails['pseudonymizer_upload_connection'] = {`
			`'provider' => 'AWS',`
			`'region' => 'eu-central-1',`
			`'aws_access_key_id' => 'AWS_ACCESS_KEY_ID',`
			`'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY'`
			`}`
			```

New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`If you are using AWS IAM profiles, omit the AWS access key and secret access key/value pairs.`
New upstream version 12.1.11 2019-09-30 21:07:59 +05:30
			```ruby
			`gitlab_rails['pseudonymizer_upload_connection'] = {`
			`'provider' => 'AWS',`
			`'region' => 'eu-central-1',`
			`'use_iam_profile' => true`
			`}`
			```
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
			`1. Save the file and [reconfigure GitLab](restart_gitlab.md#omnibus-gitlab-reconfigure)`
			`for the changes to take effect.`

			`---`

			`For installations from source:`

			1. Edit `/home/git/gitlab/config/gitlab.yml` and add or amend the following
			`lines:`

New upstream version 12.1.11 2019-09-30 21:07:59 +05:30			```yaml
			`pseudonymizer:`
			`manifest: config/pseudonymizer.yml`
			`upload:`
			`remote_directory: 'gitlab-elt' # bucket name`
			`connection:`
			`provider: AWS`
			`aws_access_key_id: AWS_ACCESS_KEY_ID`
			`aws_secret_access_key: AWS_SECRET_ACCESS_KEY`
			`region: eu-central-1`
			```
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
			`1. Save the file and [restart GitLab](restart_gitlab.md#installations-from-source)`
			`for the changes to take effect.`

New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`## Enable Pseudonymizer data collection`

			`To enable data collection:`

			`1. On the top bar, select Menu > Admin.`
			`1. On the left sidebar, select Settings > Metrics and Profiling, then expand`
			`Pseudonymizer data collection.`
			`1. Select Enable Pseudonymizer data collection.`
			`1. Select Save changes.`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`## Generate data set manually`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`You can also run the Pseudonymizer manually:`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`1. Set these environment variables:`
			- `PSEUDONYMIZER_OUTPUT_DIR` - Where to store the output CSV files. Defaults to `/tmp`.
			`These commands produce CSV files that can be quite large. Make sure the directory`
			`can store a file at least 10% of the size of your database.`
			- `PSEUDONYMIZER_BATCH` - The batch size when querying the database. Defaults to `100000`.
			`1. Run the command appropriate for your application:`
			`- Omnibus GitLab:`
			`sudo gitlab-rake gitlab:db:pseudonymizer`
			`- Installations from source:`
			`sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`After you run the command, upload the output CSV files to your configured object`
			`storage. After the upload completes, delete the output file from the local disk.`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`## Related topics`
New upstream version 11.11.7+dfsg 2019-07-31 22:56:46 +05:30
New upstream version 14.5.2+ds1 2021-12-11 22:18:48 +05:30			`- [Using object storage with GitLab](object_storage.md).`