debian-mirror-gitlab/doc/administration/pseudonymizer.md

109 lines
3.9 KiB
Markdown
Raw Normal View History

2021-01-29 00:20:46 +05:30
---
stage: none
group: unassigned
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers
---
2019-09-30 21:07:59 +05:30
# Pseudonymizer **(ULTIMATE)**
2019-07-31 22:56:46 +05:30
2020-05-24 23:13:21 +05:30
> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/5532) in [GitLab Ultimate](https://about.gitlab.com/pricing/) 11.1.
2019-07-31 22:56:46 +05:30
As GitLab's database hosts sensitive information, using it unfiltered for analytics
implies high security requirements. To help alleviate this constraint, the Pseudonymizer
service is used to export GitLab's data in a pseudonymized way.
CAUTION: **Warning:**
This process is not impervious. If the source data is available, it's possible for
a user to correlate data to the pseudonymized version.
The Pseudonymizer currently uses `HMAC(SHA256)` to mutate fields that shouldn't
be textually exported. This ensures that:
- the end-user of the data source cannot infer/revert the pseudonymized fields
- the referential integrity is maintained
## Configuration
To configure the pseudonymizer, you need to:
- Provide a manifest file that describes which fields should be included or
2019-12-04 20:38:33 +05:30
pseudonymized ([example `manifest.yml` file](https://gitlab.com/gitlab-org/gitlab/tree/master/config/pseudonymizer.yml)).
2019-09-30 21:07:59 +05:30
A default manifest is provided with the GitLab installation. Using a relative file path will be resolved from the Rails root.
2019-07-31 22:56:46 +05:30
Alternatively, you can use an absolute file path.
2019-09-30 21:07:59 +05:30
- Use an object storage and specify the connection parameters in the `pseudonymizer.upload.connection` configuration option.
2019-07-31 22:56:46 +05:30
2020-04-22 19:07:51 +05:30
[Read more about using object storage with GitLab](object_storage.md).
2019-07-31 22:56:46 +05:30
**For Omnibus installations:**
1. Edit `/etc/gitlab/gitlab.rb` and add the following lines by replacing with
the values you want:
2019-09-30 21:07:59 +05:30
```ruby
gitlab_rails['pseudonymizer_manifest'] = 'config/pseudonymizer.yml'
gitlab_rails['pseudonymizer_upload_remote_directory'] = 'gitlab-elt' # bucket name
gitlab_rails['pseudonymizer_upload_connection'] = {
'provider' => 'AWS',
'region' => 'eu-central-1',
'aws_access_key_id' => 'AWS_ACCESS_KEY_ID',
'aws_secret_access_key' => 'AWS_SECRET_ACCESS_KEY'
}
```
If you are using AWS IAM profiles, be sure to omit the AWS access key and secret access key/value pairs.
```ruby
gitlab_rails['pseudonymizer_upload_connection'] = {
'provider' => 'AWS',
'region' => 'eu-central-1',
'use_iam_profile' => true
}
```
2019-07-31 22:56:46 +05:30
1. Save the file and [reconfigure GitLab](restart_gitlab.md#omnibus-gitlab-reconfigure)
for the changes to take effect.
---
**For installations from source:**
1. Edit `/home/git/gitlab/config/gitlab.yml` and add or amend the following
lines:
2019-09-30 21:07:59 +05:30
```yaml
pseudonymizer:
manifest: config/pseudonymizer.yml
upload:
remote_directory: 'gitlab-elt' # bucket name
connection:
provider: AWS
aws_access_key_id: AWS_ACCESS_KEY_ID
aws_secret_access_key: AWS_SECRET_ACCESS_KEY
region: eu-central-1
```
2019-07-31 22:56:46 +05:30
1. Save the file and [restart GitLab](restart_gitlab.md#installations-from-source)
for the changes to take effect.
## Usage
You can optionally run the pseudonymizer using the following environment variables:
- `PSEUDONYMIZER_OUTPUT_DIR` - where to store the output CSV files (defaults to `/tmp`)
- `PSEUDONYMIZER_BATCH` - the batch size when querying the DB (defaults to `100000`)
2020-03-13 15:44:24 +05:30
```shell
2019-07-31 22:56:46 +05:30
## Omnibus
sudo gitlab-rake gitlab:db:pseudonymizer
## Source
sudo -u git -H bundle exec rake gitlab:db:pseudonymizer RAILS_ENV=production
```
This will produce some CSV files that might be very large, so make sure the
`PSEUDONYMIZER_OUTPUT_DIR` has sufficient space. As a rule of thumb, at least
10% of the database size is recommended.
After the pseudonymizer has run, the output CSV files should be uploaded to the
configured object storage and deleted from the local disk.