Shrinking Genomes, Expanding Insights: Biobank WGS Made Easy

Abstract: Whole-genome sequencing (WGS) of hundreds of thousands of people is opening new doors in understanding human health, but handling massive datasets, like the UK Biobank’s 500k genomes, is no small feat. Enter the annotated Genomic Data Structure (aGDS): a clever way to compress enormous VCF files into tiny, manageable files without losing important genetic or functional information. Using vcf2agds, the original 1,473 TiB of WGS data shrank to just 1.10 TiB, ready for analysis.

Once in a GDS format, researchers can use the STAAR pipeline to run fast, functionally informed genome analyses. This approach revealed hundreds of significant genetic associations for traits like cholesterol, covering both coding and noncoding regions. By combining efficiency, scalability, and rich annotations, aGDS is transforming how biobank-scale WGS studies are done, making it easier than ever to uncover the hidden stories in our DNA.

Get the full article here: Streamlining large-scale genomic data management: Insights from the UK Biobank whole-genome sequencing data – ScienceDirect