debian-mirror-gitlab/doc/development/import_export.md

418 lines
12 KiB
Markdown
Raw Normal View History

2021-01-29 00:20:46 +05:30
---
stage: Manage
group: Import
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers
---
2019-03-02 22:35:43 +05:30
# Import/Export development documentation
2020-04-22 19:07:51 +05:30
Troubleshooting and general development guidelines and tips for the [Import/Export feature](../user/project/settings/import_export.md).
2019-03-02 22:35:43 +05:30
<i class="fa fa-youtube-play youtube" aria-hidden="true"></i> This document is originally based on the [Import/Export 201 presentation available on YouTube](https://www.youtube.com/watch?v=V3i1OfExotE).
## Troubleshooting commands
Finds information about the status of the import and further logs using the JID:
```ruby
# Rails console
Project.find_by_full_path('group/project').import_state.slice(:jid, :status, :last_error)
> {"jid"=>"414dec93f941a593ea1a6894", "status"=>"finished", "last_error"=>nil}
```
2020-03-13 15:44:24 +05:30
```shell
2019-03-02 22:35:43 +05:30
# Logs
grep JID /var/log/gitlab/sidekiq/current
grep "Import/Export error" /var/log/gitlab/sidekiq/current
grep "Import/Export backtrace" /var/log/gitlab/sidekiq/current
2019-10-12 21:52:04 +05:30
tail /var/log/gitlab/gitlab-rails/importer.log
2019-03-02 22:35:43 +05:30
```
## Troubleshooting performance issues
Read through the current performance problems using the Import/Export below.
### OOM errors
2019-09-04 21:01:54 +05:30
Out of memory (OOM) errors are normally caused by the [Sidekiq Memory Killer](../administration/operations/sidekiq_memory_killer.md):
2019-03-02 22:35:43 +05:30
2020-03-13 15:44:24 +05:30
```shell
2019-12-21 20:55:43 +05:30
SIDEKIQ_MEMORY_KILLER_MAX_RSS = 2000000
SIDEKIQ_MEMORY_KILLER_HARD_LIMIT_RSS = 3000000
SIDEKIQ_MEMORY_KILLER_GRACE_TIME = 900
2019-03-02 22:35:43 +05:30
```
2019-12-21 20:55:43 +05:30
An import status `started`, and the following Sidekiq logs will signal a memory issue:
2019-03-02 22:35:43 +05:30
2020-03-13 15:44:24 +05:30
```shell
2019-03-02 22:35:43 +05:30
WARN: Work still in progress <struct with JID>
```
### Timeouts
2020-06-23 00:09:42 +05:30
Timeout errors occur due to the `Gitlab::Import::StuckProjectImportJobsWorker` marking the process as failed:
2019-03-02 22:35:43 +05:30
```ruby
2020-06-23 00:09:42 +05:30
module Gitlab
module Import
class StuckProjectImportJobsWorker
include Gitlab::Import::StuckImportJob
# ...
end
end
end
2019-03-02 22:35:43 +05:30
2020-06-23 00:09:42 +05:30
module Gitlab
module Import
module StuckImportJob
# ...
IMPORT_JOBS_EXPIRATION = 15.hours.to_i
# ...
def perform
stuck_imports_without_jid_count = mark_imports_without_jid_as_failed!
stuck_imports_with_jid_count = mark_imports_with_jid_as_failed!
track_metrics(stuck_imports_with_jid_count, stuck_imports_without_jid_count)
end
# ...
end
end
end
2019-03-02 22:35:43 +05:30
```
2020-03-13 15:44:24 +05:30
```shell
2019-03-02 22:35:43 +05:30
Marked stuck import jobs as failed. JIDs: xyz
```
2020-04-22 19:07:51 +05:30
```plaintext
2019-03-02 22:35:43 +05:30
+-----------+ +-----------------------------------+
|Export Job |--->| Calls ActiveRecord `as_json` and |
+-----------+ | `to_json` on all project models |
+-----------------------------------+
2019-07-07 11:18:12 +05:30
2019-03-02 22:35:43 +05:30
+-----------+ +-----------------------------------+
|Import Job |--->| Loads all JSON in memory, then |
+-----------+ | inserts into the DB in batches |
+-----------------------------------+
```
### Problems and solutions
| Problem | Possible solutions |
| -------- | -------- |
2020-04-08 14:13:33 +05:30
| [Slow JSON](https://gitlab.com/gitlab-org/gitlab/-/issues/25251) loading/dumping models from the database | [split the worker](https://gitlab.com/gitlab-org/gitlab/-/issues/25252) |
2019-03-02 22:35:43 +05:30
| | Batch export
| | Optimize SQL
| | Move away from `ActiveRecord` callbacks (difficult)
2020-04-08 14:13:33 +05:30
| High memory usage (see also some [analysis](https://gitlab.com/gitlab-org/gitlab/-/issues/18857) | DB Commit sweet spot that uses less memory |
2019-03-02 22:35:43 +05:30
| | [Netflix Fast JSON API](https://github.com/Netflix/fast_jsonapi) may help |
| | Batch reading/writing to disk and any SQL
### Temporary solutions
While the performance problems are not tackled, there is a process to workaround
importing big projects, using a foreground import:
2020-06-23 00:09:42 +05:30
[Foreground import](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/5384) of big projects for customers.
2019-03-02 22:35:43 +05:30
(Using the import template in the [infrastructure tracker](https://gitlab.com/gitlab-com/gl-infra/infrastructure/))
## Security
The Import/Export feature is constantly updated (adding new things to export), however
2019-12-21 20:55:43 +05:30
the code hasn't been refactored in a long time. We should perform a code audit (see
2020-06-23 00:09:42 +05:30
[confidential issue](../user/project/issues/confidential_issues.md) `https://gitlab.com/gitlab-org/gitlab/-/issues/20720`).
2019-03-02 22:35:43 +05:30
to make sure its dynamic nature does not increase the number of security concerns.
### Security in the code
Some of these classes provide a layer of security to the Import/Export.
The `AttributeCleaner` removes any prohibited keys:
```ruby
# AttributeCleaner
# Removes all `_ids` and other prohibited keys
class AttributeCleaner
ALLOWED_REFERENCES = RelationFactory::PROJECT_REFERENCES + RelationFactory::USER_REFERENCES + ['group_id']
2019-07-07 11:18:12 +05:30
2019-03-02 22:35:43 +05:30
def clean
@relation_hash.reject do |key, _value|
prohibited_key?(key) || !@relation_class.attribute_method?(key) || excluded_key?(key)
end.except('id')
end
2019-07-07 11:18:12 +05:30
2019-03-02 22:35:43 +05:30
...
```
The `AttributeConfigurationSpec` checks and confirms the addition of new columns:
```ruby
# AttributeConfigurationSpec
<<-MSG
It looks like #{relation_class}, which is exported using the project Import/Export, has new attributes:
Please add the attribute(s) to SAFE_MODEL_ATTRIBUTES if you consider this can be exported.
Otherwise, please blacklist the attribute(s) in IMPORT_EXPORT_CONFIG by adding it to its correspondent
model in the +excluded_attributes+ section.
SAFE_MODEL_ATTRIBUTES: #{File.expand_path(safe_attributes_file)}
IMPORT_EXPORT_CONFIG: #{Gitlab::ImportExport.config_file}
2019-07-07 11:18:12 +05:30
MSG
2019-03-02 22:35:43 +05:30
```
The `ModelConfigurationSpec` checks and confirms the addition of new models:
```ruby
# ModelConfigurationSpec
<<-MSG
New model(s) <#{new_models.join(',')}> have been added, related to #{parent_model_name}, which is exported by
the Import/Export feature.
If you think this model should be included in the export, please add it to `#{Gitlab::ImportExport.config_file}`.
Definitely add it to `#{File.expand_path(ce_models_yml)}`
to signal that you've handled this error and to prevent it from showing up in the future.
MSG
```
The `ExportFileSpec` detects encrypted or sensitive columns:
```ruby
# ExportFileSpec
<<-MSG
2019-07-07 11:18:12 +05:30
Found a new sensitive word <#{key_found}>, which is part of the hash #{parent.inspect}
2019-03-02 22:35:43 +05:30
If you think this information shouldn't get exported, please exclude the model or attribute in
IMPORT_EXPORT_CONFIG.
Otherwise, please add the exception to +safe_list+ in CURRENT_SPEC using #{sensitive_word} as the
key and the correspondent hash or model as the value.
Also, if the attribute is a generated unique token, please add it to RelationFactory::TOKEN_RESET_MODELS
if it needs to be reset (to prevent duplicate column problems while importing to the same instance).
IMPORT_EXPORT_CONFIG: #{Gitlab::ImportExport.config_file}
CURRENT_SPEC: #{__FILE__}
MSG
```
## Versioning
Import/Export does not use strict SemVer, since it has frequent constant changes
during a single GitLab release. It does require an update when there is a breaking change.
```ruby
# ImportExport
module Gitlab
module ImportExport
extend self
# For every version update, the version history in import_export.md has to be kept up to date.
VERSION = '0.2.4'
```
## Version history
2020-06-23 00:09:42 +05:30
Check the [version history](../user/project/settings/import_export.md#version-history)
for compatibility when importing and exporting projects.
2019-03-02 22:35:43 +05:30
### When to bump the version up
2020-01-01 13:55:28 +05:30
We will have to bump the version if we rename model/columns or perform any format
2019-03-02 22:35:43 +05:30
modifications in the JSON structure or the file structure of the archive file.
We do not need to bump the version up in any of the following cases:
- Add a new column or a model
- Remove a column or model (unless there is a DB constraint)
- Export new things (such as a new type of upload)
Every time we bump the version, the integration specs will fail and can be fixed with:
2020-03-13 15:44:24 +05:30
```shell
2019-03-02 22:35:43 +05:30
bundle exec rake gitlab:import_export:bump_version
```
## A quick dive into the code
### Import/Export configuration (`import_export.yml`)
The main configuration `import_export.yml` defines what models can be exported/imported.
Model relationships to be included in the project import/export:
```yaml
project_tree:
- labels:
2019-09-04 21:01:54 +05:30
- :priorities
2019-03-02 22:35:43 +05:30
- milestones:
- events:
- :push_event_payload
- issues:
- events:
2020-11-24 15:15:51 +05:30
# ...
2019-03-02 22:35:43 +05:30
```
Only include the following attributes for the models specified:
```yaml
included_attributes:
user:
- :id
- :email
2020-11-24 15:15:51 +05:30
# ...
2019-03-02 22:35:43 +05:30
```
Do not include the following attributes for the models specified:
```yaml
excluded_attributes:
project:
- :name
- :path
- ...
```
Extra methods to be called by the export:
```yaml
# Methods
methods:
labels:
- :type
label:
- :type
```
### Import
The import job status moves from `none` to `finished` or `failed` into different states:
_import\_status_: none -> scheduled -> started -> finished/failed
While the status is `started` the `Importer` code processes each step required for the import.
```ruby
# ImportExport::Importer
module Gitlab
module ImportExport
class Importer
def execute
if import_file && check_version! && restorers.all?(&:restore) && overwrite_project
2019-12-21 20:55:43 +05:30
project
2019-03-02 22:35:43 +05:30
else
raise Projects::ImportService::Error.new(@shared.errors.join(', '))
end
rescue => e
raise Projects::ImportService::Error.new(e.message)
ensure
remove_import_file
end
2019-07-07 11:18:12 +05:30
2019-03-02 22:35:43 +05:30
def restorers
[repo_restorer, wiki_restorer, project_tree, avatar_restorer,
uploads_restorer, lfs_restorer, statistics_restorer]
end
```
The export service, is similar to the `Importer`, restoring data instead of saving it.
### Export
```ruby
# ImportExport::ExportService
module Projects
module ImportExport
class ExportService < BaseService
def save_all!
if save_services
Gitlab::ImportExport::Saver.save(project: project, shared: @shared)
notify_success
else
cleanup_and_notify_error!
end
end
def save_services
2019-07-07 11:18:12 +05:30
[version_saver, avatar_saver, project_tree_saver, uploads_saver, repo_saver,
2019-03-02 22:35:43 +05:30
wiki_repo_saver, lfs_saver].all?(&:save)
end
```
2020-05-24 23:13:21 +05:30
## Test fixtures
Fixtures used in Import/Export specs live in `spec/fixtures/lib/gitlab/import_export`. There are both Project and Group fixtures.
There are two versions of each of these fixtures:
- A human readable single JSON file with all objects, called either `project.json` or `group.json`.
- A folder named `tree`, containing a tree of files in `ndjson` format. **Please do not edit files under this folder manually unless strictly necessary.**
The tools to generate the NDJSON tree from the human-readable JSON files live in the [`gitlab-org/memory-team/team-tools`](https://gitlab.com/gitlab-org/memory-team/team-tools/-/blob/master/import-export/) project.
### Project
**Please use `legacy-project-json-to-ndjson.sh` to generate the NDJSON tree.**
The NDJSON tree will look like this:
```shell
tree
├── project
2021-01-29 00:20:46 +05:30
│ ├── auto_devops.ndjson
│ ├── boards.ndjson
│ ├── ci_cd_settings.ndjson
│ ├── ci_pipelines.ndjson
│ ├── container_expiration_policy.ndjson
│ ├── custom_attributes.ndjson
│ ├── error_tracking_setting.ndjson
│ ├── external_pull_requests.ndjson
│ ├── issues.ndjson
│ ├── labels.ndjson
│ ├── merge_requests.ndjson
│ ├── milestones.ndjson
│ ├── pipeline_schedules.ndjson
│ ├── project_badges.ndjson
│ ├── project_feature.ndjson
│ ├── project_members.ndjson
│ ├── protected_branches.ndjson
│ ├── protected_tags.ndjson
│ ├── releases.ndjson
│ ├── services.ndjson
│ ├── snippets.ndjson
│ └── triggers.ndjson
2020-05-24 23:13:21 +05:30
└── project.json
```
### Group
**Please use `legacy-group-json-to-ndjson.rb` to generate the NDJSON tree.**
The NDJSON tree will look like this:
```shell
tree
└── groups
├── 4351
2021-01-29 00:20:46 +05:30
│ ├── badges.ndjson
│ ├── boards.ndjson
│ ├── epics.ndjson
│ ├── labels.ndjson
│ ├── members.ndjson
│ └── milestones.ndjson
2020-05-24 23:13:21 +05:30
├── 4352
2021-01-29 00:20:46 +05:30
│ ├── badges.ndjson
│ ├── boards.ndjson
│ ├── epics.ndjson
│ ├── labels.ndjson
│ ├── members.ndjson
│ └── milestones.ndjson
2020-05-24 23:13:21 +05:30
├── _all.ndjson
├── 4351.json
└── 4352.json
```
2020-07-28 23:09:34 +05:30
CAUTION: **Caution:**
When updating these fixtures, please ensure you update both `json` files and `tree` folder, as the tests apply to both.