debian-mirror-gitlab/doc/topics/git/partial_clone.md

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

210 lines
8.4 KiB
Markdown
Raw Normal View History

2020-10-24 23:57:45 +05:30
---
stage: Create
group: Source Code
2021-02-22 17:27:13 +05:30
info: "To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments"
2020-10-24 23:57:45 +05:30
type: reference, howto
---
2021-04-29 21:17:54 +05:30
# Partial clone **(FREE)**
2020-04-22 19:07:51 +05:30
As Git repositories grow in size, they can become cumbersome to work with
2021-04-29 21:17:54 +05:30
because of:
- The large amount of history that must be downloaded.
- The large amount of disk space they require.
2019-10-12 21:52:04 +05:30
[Partial clone](https://github.com/git/git/blob/master/Documentation/technical/partial-clone.txt)
is a performance optimization that "allows Git to function without having a
complete copy of the repository. The goal of this work is to allow Git better
handle extremely large repositories."
2020-05-24 23:13:21 +05:30
Git 2.22.0 or later is required.
2020-05-05 14:28:15 +05:30
2020-04-22 19:07:51 +05:30
## Filter by file size
2019-10-12 21:52:04 +05:30
2020-04-22 19:07:51 +05:30
> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/2553) in GitLab 12.10.
2019-10-12 21:52:04 +05:30
2020-04-22 19:07:51 +05:30
Storing large binary files in Git is normally discouraged, because every large
2021-04-17 20:07:23 +05:30
file added is downloaded by everyone who clones or fetches changes
2022-07-16 23:28:13 +05:30
thereafter. These downloads are slow and problematic, especially when working from a slow
2020-04-22 19:07:51 +05:30
or unreliable internet connection.
2019-10-12 21:52:04 +05:30
2020-04-22 19:07:51 +05:30
Using partial clone with a file size filter solves this problem, by excluding
troublesome large files from clones and fetches. When Git encounters a missing
2021-04-17 20:07:23 +05:30
file, it's downloaded on demand.
2019-10-12 21:52:04 +05:30
2020-04-22 19:07:51 +05:30
When cloning a repository, use the `--filter=blob:limit=<size>` argument. For example,
to clone the repository excluding files larger than 1 megabyte:
2019-10-12 21:52:04 +05:30
2020-03-13 15:44:24 +05:30
```shell
2020-04-22 19:07:51 +05:30
git clone --filter=blob:limit=1m git@gitlab.com:gitlab-com/www-gitlab-com.git
2019-12-21 20:55:43 +05:30
```
2020-04-22 19:07:51 +05:30
This would produce the following output:
```plaintext
Cloning into 'www-gitlab-com'...
remote: Enumerating objects: 832467, done.
remote: Counting objects: 100% (832467/832467), done.
remote: Compressing objects: 100% (207226/207226), done.
remote: Total 832467 (delta 585563), reused 826624 (delta 580099), pack-reused 0
Receiving objects: 100% (832467/832467), 2.34 GiB | 5.05 MiB/s, done.
Resolving deltas: 100% (585563/585563), done.
remote: Enumerating objects: 146, done.
remote: Counting objects: 100% (146/146), done.
remote: Compressing objects: 100% (138/138), done.
remote: Total 146 (delta 8), reused 144 (delta 8), pack-reused 0
Receiving objects: 100% (146/146), 471.45 MiB | 4.60 MiB/s, done.
Resolving deltas: 100% (8/8), done.
Updating files: 100% (13008/13008), done.
Filtering content: 100% (3/3), 131.24 MiB | 4.65 MiB/s, done.
2019-12-21 20:55:43 +05:30
```
2019-10-12 21:52:04 +05:30
2021-04-29 21:17:54 +05:30
The output is longer because Git:
1. Clones the repository excluding files larger than 1 megabyte.
1. Downloads any missing large files needed to check out the default branch.
2020-04-22 19:07:51 +05:30
2022-07-16 23:28:13 +05:30
When changing branches, Git may download more missing files.
2020-04-22 19:07:51 +05:30
## Filter by object type
> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/2553) in GitLab 12.10.
2021-04-29 21:17:54 +05:30
For repositories with millions of files and a long history, you can exclude all files and use
[`git sparse-checkout`](https://git-scm.com/docs/git-sparse-checkout) to reduce the size of
your working copy.
2020-04-22 19:07:51 +05:30
```plaintext
# Clone the repo excluding all files
2020-05-24 23:13:21 +05:30
$ git clone --filter=blob:none --sparse git@gitlab.com:gitlab-com/www-gitlab-com.git
2020-04-22 19:07:51 +05:30
Cloning into 'www-gitlab-com'...
remote: Enumerating objects: 678296, done.
remote: Counting objects: 100% (678296/678296), done.
remote: Compressing objects: 100% (165915/165915), done.
remote: Total 678296 (delta 472342), reused 673292 (delta 467476), pack-reused 0
Receiving objects: 100% (678296/678296), 81.06 MiB | 5.74 MiB/s, done.
Resolving deltas: 100% (472342/472342), done.
remote: Enumerating objects: 28, done.
remote: Counting objects: 100% (28/28), done.
remote: Compressing objects: 100% (25/25), done.
remote: Total 28 (delta 0), reused 12 (delta 0), pack-reused 0
Receiving objects: 100% (28/28), 140.29 KiB | 341.00 KiB/s, done.
Updating files: 100% (28/28), done.
$ cd www-gitlab-com
2021-01-29 00:20:46 +05:30
$ git sparse-checkout init --cone
2020-04-22 19:07:51 +05:30
$ git sparse-checkout add data
remote: Enumerating objects: 301, done.
remote: Counting objects: 100% (301/301), done.
remote: Compressing objects: 100% (292/292), done.
remote: Total 301 (delta 16), reused 102 (delta 9), pack-reused 0
Receiving objects: 100% (301/301), 1.15 MiB | 608.00 KiB/s, done.
Resolving deltas: 100% (16/16), done.
Updating files: 100% (302/302), done.
```
2019-10-12 21:52:04 +05:30
2020-04-22 19:07:51 +05:30
For more details, see the Git documentation for
[`sparse-checkout`](https://git-scm.com/docs/git-sparse-checkout).
2019-10-12 21:52:04 +05:30
2020-04-22 19:07:51 +05:30
## Filter by file path
2019-10-12 21:52:04 +05:30
2021-04-29 21:17:54 +05:30
Deeper integration between partial clone and sparse checkout is possible through the
`--filter=sparse:oid=<blob-ish>` filter spec. This mode of filtering uses a format similar to a
`.gitignore` file to specify which files to include when cloning and fetching.
2019-10-12 21:52:04 +05:30
2021-04-29 21:17:54 +05:30
WARNING:
Partial clone using `sparse` filters is still experimental. It might be slow and significantly increase
[Gitaly](../../administration/gitaly/index.md) resource utilization when cloning and fetching.
[Filter all blobs and use sparse-checkout](#filter-by-object-type) instead, because
[`git-sparse-checkout`](https://git-scm.com/docs/git-sparse-checkout) simplifies
this type of partial clone use and overcomes its limitations.
2019-10-12 21:52:04 +05:30
2020-04-22 19:07:51 +05:30
For more details, see the Git documentation for
2021-04-29 21:17:54 +05:30
[`rev-list-options`](https://git-scm.com/docs/git-rev-list#Documentation/git-rev-list.txt---filterltfilter-specgt).
2019-10-12 21:52:04 +05:30
2021-04-29 21:17:54 +05:30
1. Create a filter spec. For example, consider a monolithic repository with many applications,
each in a different subdirectory in the root. Create a file `shiny-app/.filterspec`:
2019-10-12 21:52:04 +05:30
2020-05-24 23:13:21 +05:30
```plaintext
2019-10-12 21:52:04 +05:30
# Only the paths listed in the file will be downloaded when performing a
# partial clone using `--filter=sparse:oid=shiny-app/.gitfilterspec`
# Explicitly include filterspec needed to configure sparse checkout with
# git config --local core.sparsecheckout true
# git show master:snazzy-app/.gitfilterspec >> .git/info/sparse-checkout
shiny-app/.gitfilterspec
# Shiny App
shiny-app/
# Dependencies
shimmery-app/
shared-component-a/
shared-component-b/
```
2021-04-29 21:17:54 +05:30
1. Clone and filter by path. Support for `--filter=sparse:oid` using the
2022-07-16 23:28:13 +05:30
clone command is not fully integrated with sparse checkout.
2019-10-12 21:52:04 +05:30
2020-03-13 15:44:24 +05:30
```shell
2019-10-12 21:52:04 +05:30
2021-04-29 21:17:54 +05:30
# Clone the filtered set of objects using the filterspec stored on the
# server. WARNING: this step may be very slow!
git clone --sparse --filter=sparse:oid=master:shiny-app/.gitfilterspec <url>
2019-10-12 21:52:04 +05:30
# Optional: observe there are missing objects that we have not fetched
git rev-list --all --quiet --objects --missing=print | wc -l
```
2021-02-22 17:27:13 +05:30
WARNING:
2019-10-12 21:52:04 +05:30
Git integrations with `bash`, `zsh`, etc and editors that automatically
2021-04-17 20:07:23 +05:30
show Git status information often run `git fetch` which fetches the
2022-07-16 23:28:13 +05:30
entire repository. Disabling or reconfiguring these integrations might be required.
2019-10-12 21:52:04 +05:30
2020-10-24 23:57:45 +05:30
## Remove partial clone filtering
Git repositories with partial clone filtering can have the filtering removed. To
remove filtering:
1. Fetch everything that has been excluded by the filters, to make sure that the
repository is complete. If `git sparse-checkout` was used, use
`git sparse-checkout disable` to disable it. See the
[`disable` documentation](https://git-scm.com/docs/git-sparse-checkout#Documentation/git-sparse-checkout.txt-emdisableem)
for more information.
Then do a regular `fetch` to ensure that the repository is complete. To check if
there are missing objects to fetch, and then fetch them, especially when not using
`git sparse-checkout`, the following commands can be used:
```shell
# Show missing objects
git rev-list --objects --all --missing=print | grep -e '^\?'
# Show missing objects without a '?' character before them (needs GNU grep)
git rev-list --objects --all --missing=print | grep -oP '^\?\K\w+'
# Fetch missing objects
git fetch origin $(git rev-list --objects --all --missing=print | grep -oP '^\?\K\w+')
# Show number of missing objects
git rev-list --objects --all --missing=print | grep -e '^\?' | wc -l
```
1. Repack everything. This can be done using `git repack -a -d`, for example. This
should leave only three files in `.git/objects/pack/`:
- A `pack-<SHA1>.pack` file.
- Its corresponding `pack-<SHA1>.idx` file.
- A `pack-<SHA1>.promisor` file.
1. Delete the `.promisor` file. The above step should have left only one
`pack-<SHA1>.promisor` file, which should be empty and should be deleted.
1. Remove partial clone configuration. The partial clone-related configuration
2021-03-08 18:12:59 +05:30
variables should be removed from Git configuration files. Usually only the following
2020-10-24 23:57:45 +05:30
configuration must be removed:
- `remote.origin.promisor`.
- `remote.origin.partialclonefilter`.