Why Git LFS

Data Packages contain research data that is often large in size and may be sensitive in nature. Storing this data securely and tracking changes over time reliably is essential for good data governance and reproducible research. This post explains why we chose Git LFS over other alternatives to manage data files.

Published

January 28, 2026

Context and problem statement

Data Packages often contain large data files that may be sensitive in nature, for example because they include personally identifiable information. This data can evolve over time through the addition of new files, as well as through the correction or redaction of existing data items. When researchers start an analysis project, they are typically provided with a subset of the data as it exists at a specific point in time. Being able to track what changed between updates to a Data Package—and to specify which version of the data is used by each iteration of an analysis project—is important both for reproducibility and data governance.

Together, these considerations mean that we need to version control data files contained in Data Packages, while also ensuring that sensitive data never leaves the secure server on which it is stored.

While Git is well suited to tracking changes in source code, it is not optimised for tracking data files. Working-tree operations on large files are slow, and delta compression offers limited benefits for binary data. Moreover, because sensitive data must remain on a secure server for legal reasons (e.g. the EU’s GDPR), it cannot be stored on GitHub as part of the central repository.

Given that there are multiple tools available for the version control of large files, the question we need to answer is:

Which file version control system should we adopt to store and version control our data files while complying with data privacy policies and laws?

Decision drivers

The file version control system should:

  • Support version control of large files.
  • Preserve a sequential history of changes.
  • Allow the data to be restored to a specific point in its history.
  • Work well in a secure server context (such as GenomeDK).
  • Allow files to be stored locally on a secure server without pushing data to an external remote (e.g., GitHub).
  • Be easy to install on a secure server without requiring administrative privileges.
  • Enable working on personal machines outside the secure server, where the data is not available.
  • Integrate well with Git, which we will continue to use to version control code and metadata.
  • Be easy to configure and use.
  • Be actively maintained, widely adopted, and well documented.

Considered options

The most popular data version control systems are:

Git LFS

Git LFS is a widely used system for the version control of large files within Git-based workflows. It is an open-source Git extension developed and maintained by GitHub and other contributors. Git LFS works by storing large data files in a separate LFS object store and replacing them in the Git repository with small pointer files that reference the stored objects.

Benefits

  • Maintains a history of file versions over time.
  • Allows restoring data to previous versions.
  • While git diff shows the difference between the pointers by default, it is possible to configure Git to show the difference between the data files.
  • Possible to declare rules for which folders or file types should be tracked automatically.
  • Can be installed without administrative privileges, for example, via conda on GenomeDK.
  • Supports local or self-hosted storage: a shared folder on GenomeDK can be used as the LFS object store instead of a GitHub-managed LFS server.
  • Integrates seamlessly with Git; the same Git commands are used to manage data and code files.
  • Supports working without data access on personal machines by configuring Git not to run LFS-related logic outside the secure server.
  • Easy to set up and use, requiring relatively little configuration.
  • Very popular, well established, and widely used.
  • Lots of guides and resources (though the official documentation could be better).

Drawbacks

  • No delta compression: each new version of a data file is stored in full.
  • Risk of accidental data uploads: the default configuration assumes a GitHub-managed LFS server, so safeguards may be required to prevent data from being uploaded to GitHub due to missing configuration or a misconfigured environment.

git-annex

git-annex is an open-source, distributed file management system. Its most common use case is synchronising (large) files across multiple storage locations and media, such as personal computers, external hard drives, and cloud storage. It works by extending a Git repository with an annex that has a dual function: it tracks which files are available in which storage locations, and it acts as a local object store for file contents that are present in the working tree. Files in the working tree are symbolic links to this local object store.

Benefits

  • Maintains a history of file versions over time.
  • Allows restoring data to a previous version.
  • Storage options are very flexible, declaring a local folder on a secure server as the only storage location is easy to do.
  • Possible to declare rules based on path or file size for which files to annex automatically.
  • Allows working without data access on personal machines.
  • Does not assume any specific hosting provider by default, so there is a lower risk of accidentally pushing data to a remote location such as GitHub.
  • Can be installed without administrative privileges, for example, via conda on GenomeDK.
  • Possible to manage the annex via a GUI by running a locally served web app.
  • Actively maintained and long established.

Drawbacks

  • More complex to set up and use than Git LFS.
  • The main use case is not a perfect fit for us, because we will only have one storage location and won’t need to synchronise data across several remotes.
  • The basic workflow of adding and editing data files is more complex, because tracking the files with git-annex and storing the files in a particular storage location are two distinct tasks.
  • Less widely adopted than Git LFS, mainly used by specialist communities.

DVC

DVC is a large file management system similar to Git LFS, developed to address data versioning challenges specific to machine learning workflows. It works by storing data files outside the Git repository and replacing them with Git-tracked metadata files that reference the actual data. In addition to basic data versioning, DVC provides features that support the broader machine learning lifecycle.

Benefits

  • Maintains a history of file versions over time.
  • Allows restoring data to previous versions.
  • Storage options are very flexible.
  • Allows working without data access on personal machines.
  • Does not assume any specific hosting provider by default.
  • Can be installed via uv.
  • Makes it easy to import data from other DVC repositories.
  • Good documentation.

Drawbacks

  • More complex to set up and use than Git LFS.
  • Many of DVC’s machine learning–specific features are not relevant to our use case.
  • The basic workflow of adding and editing data files is more complex than with Git LFS, because separate commands have to be issued for managing data files and code files.

Decision outcome

We decided on using Git LFS because, while all of the options considered fulfil our requirements, Git LFS is the simplest to set up and use. It integrates fully with Git, meaning that, after initial configuration, there is no need to learn an additional version control system, and that it is possible to inspect differences between data versions when needed. The ability to declare rules for automatically tracking specific files or directories is also particularly useful for our workflow. As we are not using a GitHub-managed server to store our data, we do not need to weigh these benefits against GitHub-specific storage limits or costs.

Our goals could also be achieved using either git-annex or DVC, both of which are more powerful tools that offer finer-grained control over data management operations. However, our use case is relatively simple, and ease of use is a key factor in our decision. DVC may nevertheless be of interest for analysis projects, as it can record rich metadata about datasets and can track which scripts produced which data explicitly through metadata files. But, since Data Packages aren’t analysis projects, the simplicity and full Git integration of Git LFS outweigh the benefits of DVC.

Consequences

  • Because Git LFS is tightly integrated with Git and GitHub, its default configuration assumes a GitHub-managed LFS server. As a result, additional safeguards (e.g., checks via pre-commit hooks) may be required to prevent sensitive data from being uploaded to GitHub by accident.
  • Even with such safeguards in place, as with any other tool, it is not possible to completely eliminate the risk of accidental data upload due to human error. Clear documentation and guidance are therefore required to ensure that contributors understand the correct workflow when working with the version control of sensitive data.

Resources used for this post