debian-mirror-gitlab/doc/administration/geo/replication/security_review.md

292 lines
13 KiB
Markdown
Raw Normal View History

2020-06-23 00:09:42 +05:30
---
stage: Enablement
group: Geo
2021-02-22 17:27:13 +05:30
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
2020-06-23 00:09:42 +05:30
type: howto
---
2019-09-30 21:07:59 +05:30
# Geo security review (Q&A) **(PREMIUM ONLY)**
2019-07-31 22:56:46 +05:30
2019-12-21 20:55:43 +05:30
The following security review of the Geo feature set focuses on security aspects of
the feature as they apply to customers running their own GitLab instances. The review
2020-03-13 15:44:24 +05:30
questions are based in part on the [OWASP Application Security Verification Standard Project](https://owasp.org/www-project-application-security-verification-standard/)
2020-04-22 19:07:51 +05:30
from [owasp.org](https://owasp.org/).
2019-07-31 22:56:46 +05:30
## Business Model
### What geographic areas does the application service?
- This varies by customer. Geo allows customers to deploy to multiple areas,
and they get to choose where they are.
- Region and node selection is entirely manual.
## Data Essentials
### What data does the application receive, produce, and process?
- Geo streams almost all data held by a GitLab instance between sites. This
includes full database replication, most files (user-uploaded attachments,
etc) and repository + wiki data. In a typical configuration, this will
happen across the public Internet, and be TLS-encrypted.
- PostgreSQL replication is TLS-encrypted.
2020-06-23 00:09:42 +05:30
- See also: [only TLSv1.2 should be supported](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/2948)
2019-07-31 22:56:46 +05:30
### How can the data be classified into categories according to its sensitivity?
2021-02-22 17:27:13 +05:30
- The GitLab model of sensitivity is centered around public vs. internal vs.
2019-07-31 22:56:46 +05:30
private projects. Geo replicates them all indiscriminately. “Selective sync”
exists for files and repositories (but not database content), which would permit
only less-sensitive projects to be replicated to a **secondary** node if desired.
2020-11-24 15:15:51 +05:30
- See also: [GitLab data classification policy](https://about.gitlab.com/handbook/engineering/security/data-classification-standard.html).
2019-07-31 22:56:46 +05:30
### What data backup and retention requirements have been defined for the application?
- Geo is designed to provide replication of a certain subset of the application
data. It is part of the solution, rather than part of the problem.
## End-Users
### Who are the application's endusers?
- **Secondary** nodes are created in regions that are distant (in terms of
Internet latency) from the main GitLab installation (the **primary** node). They are
intended to be used by anyone who would ordinarily use the **primary** node, who finds
that the **secondary** node is closer to them (in terms of Internet latency).
### How do the endusers interact with the application?
- **Secondary** nodes provide all the interfaces a **primary** node does
2019-12-21 20:55:43 +05:30
(notably a HTTP/HTTPS web application, and HTTP/HTTPS or SSH Git repository
2019-07-31 22:56:46 +05:30
access), but is constrained to read-only activities. The principal use case is
2019-12-21 20:55:43 +05:30
envisioned to be cloning Git repositories from the **secondary** node in favor of the
2019-07-31 22:56:46 +05:30
**primary** node, but end-users may use the GitLab web interface to view projects,
issues, merge requests, snippets, etc.
### What security expectations do the endusers have?
- The replication process must be secure. It would typically be unacceptable to
transmit the entire database contents or all files and repositories across the
public Internet in plaintext, for instance.
- **Secondary** nodes must have the same access controls over its content as the
**primary** node - unauthenticated users must not be able to gain access to privileged
information on the **primary** node by querying the **secondary** node.
- Attackers must not be able to impersonate the **secondary** node to the **primary** node, and
thus gain access to privileged information.
## Administrators
### Who has administrative capabilities in the application?
- Nothing Geo-specific. Any user where `admin: true` is set in the database is
considered an admin with super-user privileges.
2020-06-23 00:09:42 +05:30
- See also: [more granular access control](https://gitlab.com/gitlab-org/gitlab/-/issues/18242)
2020-05-24 23:13:21 +05:30
(not Geo-specific).
2019-07-31 22:56:46 +05:30
- Much of Geos integration (database replication, for instance) must be
configured with the application, typically by system administrators.
### What administrative capabilities does the application offer?
- **Secondary** nodes may be added, modified, or removed by users with
administrative access.
- The replication process may be controlled (start/stop) via the Sidekiq
administrative controls.
## Network
### What details regarding routing, switching, firewalling, and loadbalancing have been defined?
- Geo requires the **primary** node and **secondary** node to be able to communicate with each
other across a TCP/IP network. In particular, the **secondary** nodes must be able to
access HTTP/HTTPS and PostgreSQL services on the **primary** node.
### What core network devices support the application?
- Varies from customer to customer.
### What network performance requirements exist?
- Maximum replication speeds between **primary** node and **secondary** node is limited by the
available bandwidth between sites. No hard requirements exist - time to complete
replication (and ability to keep up with changes on the **primary** node) is a function
of the size of the data set, tolerance for latency, and available network
capacity.
### What private and public network links support the application?
- Customers choose their own networks. As sites are intended to be
geographically separated, it is envisioned that replication traffic will pass
over the public Internet in a typical deployment, but this is not a requirement.
## Systems
### What operating systems support the application?
- Geo imposes no additional restrictions on operating system (see the
2019-09-30 21:07:59 +05:30
[GitLab installation](https://about.gitlab.com/install/) page for more
2020-11-24 15:15:51 +05:30
details), however we recommend using the operating systems listed in the [Geo documentation](../index.md#requirements-for-running-geo).
2019-07-31 22:56:46 +05:30
### What details regarding required OS components and lockdown needs have been defined?
2019-09-04 21:01:54 +05:30
- The supported installation method (Omnibus) packages most components itself.
2019-07-31 22:56:46 +05:30
- There are significant dependencies on the system-installed OpenSSH daemon (Geo
requires users to set up custom authentication methods) and the omnibus or
system-provided PostgreSQL daemon (it must be configured to listen on TCP,
additional users and replication slots must be added, etc).
- The process for dealing with security updates (for example, if there is a
significant vulnerability in OpenSSH or other services, and the customer
wants to patch those services on the OS) is identical to the non-Geo
situation: security updates to OpenSSH would be provided to the user via the
usual distribution channels. Geo introduces no delay there.
## Infrastructure Monitoring
### What network and system performance monitoring requirements have been defined?
- None specific to Geo.
### What mechanisms exist to detect malicious code or compromised application components?
- None specific to Geo.
### What network and system security monitoring requirements have been defined?
- None specific to Geo.
## Virtualization and Externalization
### What aspects of the application lend themselves to virtualization?
- All.
## What virtualization requirements have been defined for the application?
- Nothing Geo-specific, but everything in GitLab needs to have full
functionality in such an environment.
### What aspects of the product may or may not be hosted via the cloud computing model?
- GitLab is “cloud native” and this applies to Geo as much as to the rest of the
product. Deployment in clouds is a common and supported scenario.
## If applicable, what approach(es) to cloud computing will be taken (Managed Hosting versus "Pure" Cloud, a "full machine" approach such as AWS-EC2 versus a "hosted database" approach such as AWS-RDS and Azure, etc)?
- To be decided by our customers, according to their operational needs.
## Environment
### What frameworks and programming languages have been used to create the application?
- Ruby on Rails, Ruby.
### What process, code, or infrastructure dependencies have been defined for the application?
- Nothing specific to Geo.
### What databases and application servers support the application?
2020-05-24 23:13:21 +05:30
- PostgreSQL >= 11, Redis, Sidekiq, Puma.
2019-07-31 22:56:46 +05:30
### How will database connection strings, encryption keys, and other sensitive components be stored, accessed, and protected from unauthorized detection?
- There are some Geo-specific values. Some are shared secrets which must be
securely transmitted from the **primary** node to the **secondary** node at setup time. Our
documentation recommends transmitting them from the **primary** node to the system
administrator via SSH, and then back out to the **secondary** node in the same manner.
In particular, this includes the PostgreSQL replication credentials and a secret
key (`db_key_base`) which is used to decrypt certain columns in the database.
The `db_key_base` secret is stored unencrypted on the filesystem, in
`/etc/gitlab/gitlab-secrets.json`, along with a number of other secrets. There is
no at-rest protection for them.
## Data Processing
### What data entry paths does the application support?
- Data is entered via the web application exposed by GitLab itself. Some data is
also entered using system administration commands on the GitLab servers (e.g.,
`gitlab-ctl set-primary-node`).
- **Secondary** nodes also receive inputs via PostgreSQL streaming replication from the **primary** node.
### What data output paths does the application support?
- **Primary** nodes output via PostgreSQL streaming replication to the **secondary** node.
Otherwise, principally via the web application exposed by GitLab itself, and via
SSH `git clone` operations initiated by the end-user.
### How does data flow across the application's internal components?
- **Secondary** nodes and **primary** nodes interact via HTTP/HTTPS (secured with JSON web
tokens) and via PostgreSQL streaming replication.
- Within a **primary** node or **secondary** node, the SSOT is the filesystem and the database
(including Geo tracking database on **secondary** node). The various internal components
are orchestrated to make alterations to these stores.
### What data input validation requirements have been defined?
- **Secondary** nodes must have a faithful replication of the **primary** nodes data.
### What data does the application store and how?
- Git repositories and files, tracking information related to the them, and the GitLab database contents.
### What data is or may need to be encrypted and what key management requirements have been defined?
- Neither **primary** nodes or **secondary** nodes encrypt Git repository or filesystem data at
rest. A subset of database columns are encrypted at rest using the `db_otp_key`.
- A static secret shared across all hosts in a GitLab deployment.
- In transit, data should be encrypted, although the application does permit
communication to proceed unencrypted. The two main transits are the **secondary** nodes
2019-12-21 20:55:43 +05:30
replication process for PostgreSQL, and for Git repositories/files. Both should
2019-07-31 22:56:46 +05:30
be protected using TLS, with the keys for that managed via Omnibus per existing
configuration for end-user access to GitLab.
### What capabilities exist to detect the leakage of sensitive data?
- Comprehensive system logs exist, tracking every connection to GitLab and PostgreSQL.
### What encryption requirements have been defined for data in transit - including transmission over WAN, LAN, SecureFTP, or publicly accessible protocols such as http: and https:?
- Data must have the option to be encrypted in transit, and be secure against
both passive and active attack (e.g., MITM attacks should not be possible).
## Access
### What user privilege levels does the application support?
- Geo adds one type of privilege: **secondary** nodes can access a special Geo API to
download files over HTTP/HTTPS, and to clone repositories using HTTP/HTTPS.
### What user identification and authentication requirements have been defined?
- **Secondary** nodes identify to Geo **primary** nodes via OAuth or JWT authentication
based on the shared database (HTTP access) or a PostgreSQL replication user (for
database replication). The database replication also requires IP-based access
controls to be defined.
### What user authorization requirements have been defined?
- **Secondary** nodes must only be able to *read* data. They are not currently able to mutate data on the **primary** node.
### What session management requirements have been defined?
- Geo JWTs are defined to last for only two minutes before needing to be regenerated.
- Geo JWTs are generated for one of the following specific scopes:
- Geo API access.
- Git access.
- LFS and File ID.
- Upload and File ID.
- Job Artifact and File ID.
### What access requirements have been defined for URI and Service calls?
- **Secondary** nodes make many calls to the **primary** node's API. This is how file
replication proceeds, for instance. This endpoint is only accessible with a JWT token.
- The **primary** node also makes calls to the **secondary** node to get status information.
## Application Monitoring
### What application auditing requirements have been defined? How are audit and debug logs accessed, stored, and secured?
- Structured JSON log is written to the filesystem, and can also be ingested
into a Kibana installation for further analysis.