Large-Scale Genomics (LSG) Work Stream

Develops products to describe, compress, store, encrypt, and transfer genomic data in a scalable way.

Genomic sequencing generates data at an increasingly large scale. As public and national health systems continue to adopt genomic testing and more private companies launch genomic projects, the vast quantity of raw sequencing data will only balloon further. The Large-Scale Genomics (LSG) Work Stream produces robust, standardised ways to describe, store, and access this vital genomic information.

Jump to...

Products Publications Announcements Meetings Documents Contributors

Image summary: The LSG Work Stream develops efficient formats to store, access, and analyse sequencing reads, genetic variation, and gene expression information.

Contribute to this Work Stream

Subscribe to receive meeting invitations and real-time announcements.

Join the Work Stream

Technical description

Produces standardised file formats and remote access protocols for storing, compressing, encrypting, querying, and sharing genomic data at scale.

work stream leads

Geraldine Van der Auwera
Oliver Hofmann

staff contact

Reggan Thomas

TOOLS & PLATFORMS

GitHub

Products

Browser Extensible Data (BED)

Stores information on the location of genomic features

Genomic/phenotypic/clinical data representation

Technical specification File format

Large-Scale Genomics (LSG) Work Stream

Approved

CRAM

Uses data compression strategies to efficiently store genomic data

Genomic/phenotypic/clinical data representation

Technical specification File format

Large-Scale Genomics (LSG) Work Stream

Approved

Genetic Data Encryption (Crypt4GH)

Provides an encrypted file format to keep data secure throughout its lifetime while allowing random access

Genomic/phenotypic/clinical data representation

Technical specification API

Large-Scale Genomics (LSG) Work Stream

Approved

Genetic Variation Formats (VCF)

Provides a text file format for storing genetic variation data

Genomic/phenotypic/clinical data representation

Technical specification File format

Large-Scale Genomics (LSG) Work Stream

Approved

htsget

Allows users to download read and variation data for subsections of the genome

Cloud genomics

Technical specification API

Large-Scale Genomics (LSG) Work Stream

Approved

refget Sequence Collections

Solves naming chaos for genomes by generating unique identifiers for collections of sequences

Data discovery

Technical specification API

Large-Scale Genomics (LSG) Work Stream

Approved

refget Sequences

Employs a computer algorithm to unambiguously identify reference sequences for genomic analysis

Data discovery

Technical specification API

Large-Scale Genomics (LSG) Work Stream

Approved

RNAget

Provides a common set of communication channels to efficiently retrieve RNA data of interest

Genomic/phenotypic/clinical data representation

Technical specification API

Large-Scale Genomics (LSG) Work Stream

Approved

SAM/BAM

Provides a format for storing next-generation sequencing read data

Genomic/phenotypic/clinical data representation

Technical specification File format

Large-Scale Genomics (LSG) Work Stream

Approved

Variant Benchmarking Tools

Offers methods for robustly checking variant call accuracy

Genomic/phenotypic/clinical data representation

Technical specification Technical implementation guidance

Large-Scale Genomics (LSG) Work Stream

Approved

WGS Quality Control Standards

Describes a set of quality control metrics and their detailed definitions to facilitate exchange of results across initiatives

Genomic/phenotypic/clinical data representation

Technical specification Data model / ontology

Large-Scale Genomics (LSG) Work Stream, Genomic Knowledge Standards (GKS) Work Stream

In development

See all related products

Publications

Community Resources

Dive deeper into our Work Stream! LSG produces standardised methods for storing, accessing, and analysing genomic data (reads, variants, and expression data) on a large scale. For remote queries, the Work Stream also develops standards for file-based, API-based, cloud-based, and distributed access.

#Announcements

Date

Title

Info

27 Mar 2025

The Global Alliance for Genomics and Health (GA4GH) Product Steering Committee has approved the release of two new GA4GH products: refget Sequence Collections and Variation Representation Specification (VRS) v2.0.

13 Aug 2024

Large Scale Genomics (LSG) Work Stream Crypt4GH survey is now available

6 Aug 2024

VCF v4.5 is now live!

29 Feb 2024

Refget: Sequence Collections v1.0 Open for Public Comment

Please review the document by Wednesday, 30 March 2024.

17 Jul 2023

refget v2.0 approved

12 May 2023

Welcome to the new GA4GH website!

Tell us what you think!

15 Feb 2023

GA4GH products open for comment: CRAM v3.1, refget v2.0

Please review and provide your feedback for CRAM v3.1 and refget v2.0 by 14 March 2023.

27 Oct 2022

Open for comment: Variant Summary Statistics Format v1.0

Please submit feedback by Thursday, 1 December 2022, at 17:00 UTC.

#Meetings

#Documents

Date

Title

27 Aug 2024

GA4GH Updates Report — August 2024

24 May 2024

Meeting report: April Connect 2024

15 Dec 2023

Driver Project Engagement Matrix

7 Nov 2023

Meeting Report: September Connect 2023

19 Sep 2023

GA4GH Product Development and Approval Process

14 Sep 2023

Code of Ethics and Community Conduct (CECC)

14 Sep 2023

Guidelines for respectful engagement

14 Sep 2023

Procedures for reviewing reported violations of the Code of Ethics and Community Conduct

5 Apr 2023

Informational packet: getting involved with GA4GH

11 Mar 2023

Meeting minutes: crypt4gh (2023)

11 Mar 2023

Meeting minutes: file formats (2023)

11 Mar 2023

Meeting minutes: Future of VCF (2023)

11 Mar 2023

Meeting minutes: LSG — htsget (2023)

11 Mar 2023

Meeting minutes: refget (2023)

11 Mar 2023

Meeting minutes: RNAget (2023)

10 Mar 2023

Start-up guide: refget

6 Mar 2023

Meeting minutes: Quality Control of Whole Genome Sequencing

6 Mar 2023

Road map: Quality Control of Whole Genome Sequencing

25 Sep 2022

Meeting minutes: Large-Scale Genomics

4 Jan 2022

Meeting minutes: crypt4gh (2022)

4 Jan 2022

Meeting minutes: RNASeq (2022)

12 Dec 2020

Meeting minutes: LSG — htsget (2021 to 2022)

8 Dec 2020

Meeting minutes: crypt4gh (2021)

8 Dec 2020

Meeting minutes: File Formats (2021)

8 Dec 2020

Meeting minutes: Future of VCF (2021)

8 Dec 2020

Meeting minutes: refget (2021)

8 Dec 2020

Meeting minutes: RNASeq (2021)

26 Apr 2019

CRAM: the genomics compression standard

#Contributors

Don't see your name? Get in touch:

Jeremy Adams
DNAstack
Shakuntala Baichoo
University of Mauritius
Dixie Baker
Martin, Blanck and Associates
Michael Baudis
University of Zurich
Edmon Begoli
Oak Ridge National Laboratory (ORNL)
Nicolas Bertin
Genome Institute of Singapore
James Bonfield
Wellcome Sanger Institute (WSI)
Guillaume Bourque
McGill University / Université McGill
David Bujold
McGill University / Université McGill
Daniel Cameron
Walter and Eliza Hall Institute of Medical Research
Timothe Cezard
EMBL's European Bioinformatics Institute (EBI)
Shu Hui Chen
NIH National Heart, Lung, and Blood Institute (NHLBI)
Guy Cochrane
Independent Contributor
Robert Davies
Wellcome Sanger Institute (WSI)
Richard Durbin
University of Cambridge
Yossi Farjoun
Lady Davis Institute
Mallory Freeberg
EMBL's European Bioinformatics Institute (EBI)
Kais Ghedira
Institut Pasteur de Tunis
Romain Gregoire
Canadian Centre for Computational Genomics
Roderic Guigo
Centre for Genomic Regulation
Sveinung Gundersen
Centre for Bioinformatics, University of Oslo
Yosr Hamdi
Institut Pasteur de Tunis
Reece Hart
MyOme
Muhammad Haseeb
EMBL's European Bioinformatics Institute (EBI)
Frédéric Haziza
Centre for Genomic Regulation
Michael Hoffman
Princess Margaret Cancer Centre
Oliver Hofmann
University of Melbourne Centre for Cancer Research
David Jackson
Wellcome Sanger Institute (WSI)
Thomas Keane
EMBL's European Bioinformatics Institute (EBI)
Jerome Kelleher
University of Oxford
Rasko Leinonen
EMBL's European Bioinformatics Institute (EBI)
Anders Leung
Independent Contributor
Mike Lin
Independent Contributor
John Marshall
University of Glasgow
Emilio Palumbo
Centre for Genomic Regulation
Martin Pollard
Wellcome Sanger Institute (WSI)
Shaikh Farhan Rashid
University Health Network, Canadian Distributed Infrastructure for Genomics (CanDIG)
Emilio Righi
Centre for Genomic Regulation
Nathan Sheffield
University of Virginia
Albert Smith
University of Michigan
Jing Su
Wellcome Sanger Institute (WSI)
Sean Upchurch
California Institute of Technology
Roman Valls Guimera
University of Melbourne Centre for Cancer Research
Zhenyu Zhang
University of Chicago

News, events, and more

Catch up with all news and articles associated with Large-Scale Genomics (LSG) Work Stream.

Colorful toolbox surrounded by gear icons against a binary code background

27 Mar 2025

refget Sequence Collections is an approved GA4GH product

Cartoon people adjusting a lightbulb with gears

21 Aug 2024

Advancing GA4GH products with the updated Product Development and Approval Process

8 Sep 2023

Scaling VCF for a genomic revolution

See all related news

About us

About us

Strategic Road Map

History

GA4GH Inc.

Leadership

Funders Forum

Equity, Diversity, and Inclusion (EDI) Advisory Group

Staff

Our community

Our community

Organisational Members

Driver Projects

Strategic Partners

Assigned Experts

Individual Contributors

What we do

What we do

Study Groups

Work Streams

GA4GH Implementation Forum

National Initiatives Forum

Communities of Interest

Technical Alignment Subcommittee (TASC)

Calendar

Our products

Our products

Product Development and Approval Process

Implementations

Get involved

Get involved

Join us

Open calls

Implement a product

Attend an event

Become a funder

Subscribe to the GA4GH newsletter

Contact us

News and events

News

Events

Announcements

Publications

Podcasts

Videos

Newsletters

See all