debian-mirror-gitlab/doc/development/uploads/background.md
2022-05-07 20:08:51 +05:30

6.7 KiB

stage group info
none unassigned To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments

Uploads guide: Why GitLab uses custom upload logic

This page is for developers trying to better understand the history behind GitLab uploads and the technical challenges associated with uploads.

Problem description

GitLab and GitLab Workhorse use special rules for handling file uploads, because in an ordinary Rails application file uploads can become expensive as files grow in size. Rails often sacrifices performance to provide a better developer experience, including how it handles multipart/form-post uploads. In any Rack server, Rails applications included, when such a request arrives at the application server, several things happen:

  1. A Rack middleware intercepts the request and parses the request body.
  2. The middleware writes each file in the multipart request to a temporary directory on disk.
  3. A params hash is constructed with entries pointing to the respective files on disk.
  4. A Rails controller acts on the file contents.

While this is convenient for developers, it is costly for the Ruby server process to buffer large files on disk. Because of Ruby's global interpreter lock, only a single thread of execution of a given Ruby process can be on CPU. This means the amount of CPU time spent doing this is not available to other worker threads serving user requests. Buffering files to disk also means spending more time in I/O routines and mode switches, which are expensive operations.

The following diagram shows how GitLab handled such a request prior to putting optimizations in place.

graph TB
    subgraph "load balancers"
      LB(Proxy)
    end

    subgraph "Shared storage"
       nfs(NFS)
    end

    subgraph "redis cluster"
       r(persisted redis)
    end
    LB-- 1 -->Workhorse

    subgraph "web or API fleet"
      Workhorse-- 2 -->rails
    end
    rails-- "3 (write files)" -->nfs
    rails-- "4 (schedule a job)" -->r

    subgraph sidekiq
      s(sidekiq)
    end
    s-- "5 (fetch a job)" -->r
    s-- "6 (read files)" -->nfs

We went through two major iterations of our uploads architecture to improve on these problems:

  1. Moving disk buffering to Workhorse.
  2. Uploading to Object Storage from Workhorse.

Moving disk buffering to Workhorse

To address the performance issues resulting from buffering files in Ruby, we moved this logic to Workhorse instead, our reverse proxy fronting the GitLab Rails application. Workhorse is written in Go, and is much better at dealing with stream processing and I/O than Rails.

There are two parts to this implementation:

  1. In Workhorse, a request handler detects multipart/form-data content in an incoming user request. If such a request is detected, Workhorse hijacks the request body before forwarding it to Rails. Workhorse writes all files to disk, rewrites the multipart form fields to point to the new locations, signs the request, then forwards it to Rails.
  2. In Rails, a custom multipart Rack middleware identifies any signed multipart requests coming from Workhorse and prepares the params hash Rails would expect, now pointing to the files cached by Workhorse. This makes it a drop-in replacement for Rack::Multipart.

The diagram below shows how GitLab handles such a request today:

graph TB
    subgraph "load balancers"
      LB(HA Proxy)
    end

    subgraph "Shared storage"
       nfs(NFS)
    end

    subgraph "redis cluster"
       r(persisted redis)
    end
    LB-- 1 -->Workhorse

    subgraph "web or API fleet"
      Workhorse-- "3 (without files)" -->rails
    end
    Workhorse -- "2 (write files)" -->nfs
    rails-- "4 (schedule a job)" -->r

    subgraph sidekiq
      s(sidekiq)
    end
    s-- "5 (fetch a job)" -->r
    s-- "6 (read files)" -->nfs

While this "one-size-fits-all" solution greatly improves performance for multipart uploads without compromising developer ergonomics, it severely limits GitLab availability and scalability.

Availability challenges

Moving file buffering to Workhorse addresses the immediate performance problems stemming from Ruby not being good at handling large file uploads. However, a remaining issue of this solution is its reliance on attached storage, whether via ordinary hard drives or network attached storage like NFS. NFS is a single point of failure, and is unsuitable for deploying GitLab in highly available, cloud native environments.

Scalability challenges

NFS is not a part of cloud native installations, such as those running in Kubernetes. In Kubernetes, machine boundaries translate to pods, and without network-attached storage, disk-buffered uploads must be written directly to the pod's file system.

Using disk buffering presents us with a scalability challenge here. If Workhorse can only write files to a pod's private file system, then these files are inaccessible outside of this particular pod. With disk buffering, a Rails controller will accept a file upload and enqueue it for upload in a Sidekiq background job. Therefore, Sidekiq requires access to these files. However, in a cloud native environment all Sidekiq instances run on separate pods, so they are not able to access files buffered to disk on a web server pod.

Therefore, all features that involve Sidekiq uploading disk-buffered files severely limit the scalability of GitLab.

Moving to object storage and direct uploads

To address these availability and scalability problems, instead of buffering files to disk, we have added support for uploading files directly from Workhorse to a given destination. While it remains possible to upload to local or network-attached storage this way, you should use a highly available object store, such as AWS S3, Google GCS, or Azure, for scalability reasons.

With direct uploads, Workhorse does not buffer files to disk. Instead, it first authorizes the request with the Rails application to find out where to upload it, then streams the file directly to its ultimate destination.

To learn more about how disk buffering and direct uploads are implemented, see: