Human Genome Informatics
Australian BioCommons is committed to ensuring that infrastructure for human genome data warehousing, sharing and analysis is implemented in Australia that adheres to various global best practice standards.
This work will benefit human health and medicine, and is in alignment with our mission to actively support life science research communities with community scale digital infrastructure that is developed and maintained in concert with international peer infrastructures.
Context
Affordable DNA sequencing at scale has enabled the genomes of hundreds of thousands of people to be determined across the world and has led to a better understanding of the causes of complex diseases, better diagnosis / early disease detection and more options for identifying tailored treatment options.
In order to achieve these outcomes, genomic information from one individual needs to be compared with multiple other genomes from similar cases in order to form cohorts of sufficient size to produce statistically meaningful outputs. This is often done across multiple efforts/jurisdictions, at a national or global scale, and requires the genomic data to be findable, searchable, shareable, and linkable to analytical capabilities.
Due to the sensitive nature of genomic information, the privacy of individuals must always be protected, and any data processing must always be done ethically, securely and safely.
How can sharing human genome data help accelerate research into understanding disease causes, treatments and prevention?
Human genome analysis across Australia: scale and challenges
Large-scale human genome sequencing and analysis efforts in Australia include those undertaken by ZERO Childhood Cancer, Australian Genomics, the University of Melbourne Centre for Cancer Research (UMCCR), the Garvan Institute of Medical Research and QIMR-Berghofer Medical Research Institute.
As of Q3 2020, these and other groups across Australia have sequenced and analysed the genomes of tens of thousands of people. Thanks to the Federal Government's recent investment of $500M over 10 years for a Genomics Health Futures Mission to support new and expanded studies in rare disease, cancer, and complex conditions, this number is predicted to increase more than 10-fold by 2025.
To date, human genome sequencing and analysis efforts across Australia have developed in-house solutions based on different technologies for storing/warehousing genome data and describing the content of these collections, and have largely manual/laborious systems for managing and providing access to data for bona fide researchers. The content of each collection is largely not transparent to outside users, and although there is a desire to share data wherever possible for research use, most have no efficient way to expose the collection content to researchers or to distribute the data, so there is currently a substantial burden to do so. All have a need to operate scalable infrastructure that is easily administered and that allows for the efficient management of data files and metadata. This management needs to include storing, security, access control, findability and shareability with relevant authorised parties.
World’s best practice infrastructure enables faster and easier human genome research
Much work is being done globally to build for a future where the responsible genomic data sharing for the benefit of human health will be routine.
This includes the groundbreaking efforts of the Global Alliance for Genomic Health (GA4GH) to create frameworks, policies and standards that can be deployed by genome efforts around the world to enable the responsible, voluntary, and secure sharing of genomic and health-related data - at scale. The animation from GA4GH shown to the right succinctly explains GA4GH’s goals and mission (credit: SciAni).
Other significant global efforts to build human genome data sharing infrastructure include that of the US National Institutes of Health (NIH) to develop Gen3 - a cloud-based software platform for managing, analysing, harmonising, and sharing large human genomic datasets. Gen3 has been used to underpin several very large NIH-funded Genomic Datasets that collectively house and describe data derived from hundreds of thousands of human samples (eg. NCI Genomic Data Commons, BioData Catalyst, BloodPAC, BrainCommons).
Additionally, the continued development of the global repository of human genomes for research purposes (the European Genome-phenome Archive (EGA)) into a federated and globally distributed resource (Federated EGA) is building towards a future where data assets remain securely in one jurisdictional location but will be findable and ultimately analysable in situ by others in other jurisdictions (i.e. by moving compute to the data).
Establishing appropriate infrastructure for human genome sharing and analysis in Australia
The Australian BioCommons Human Genome Informatics initiative is working towards establishing infrastructure for human genome data warehousing, sharing and analysis in Australia that adheres to various global best practice standards, and building the necessary foundations so that Australia can participate fully in the global ecosystem of responsible human genomics data analysis.
The aims of the initiative are:
To build for a future where Australian human genomics and health data is stored in a global federated network of public clouds, and to enable a smooth process and remove friction and artificial barriers between researchers and insights they can glean from the data.
To support Australian human genomics sequencing and analysis efforts deploy and operate scalable and globally compatible infrastructure that is easily administered and allows for the efficient management of data files and metadata (storing, security, access control, findability and shareability).
To support the ZERO and AGHA Flagships, through enabling infrastructure to: (a) share data within their consortia, (b) share data to build virtual cohorts internationally, and (c) enable collaborative analysis of these data, nationally and internationally.
These aims will be achieved by working with a range of research partners (including ZERO, AGHA, UMCCR, Garvan, QIMR-Berghofer and others), multiple infrastructure partners (including AAF, NCI, AARNet and others), as well as expert international groups establishing relevant systems (including GA4GH, the developers of Gen3, the ELIXIR Federated Human Data Community, Children’s Hospital of Philadelphia D3b, Seven Bridges Genomics and others).
The impact of this work will be that genomic data from thousands of Australians will be able to be shared securely and responsibly on national and global scales, enabling comparison with very large numbers of other genomes to ensure their full research value can be realised.
Watch the excellent animation shown on the right from GA4GH which explains much of this global human data sharing vision, and how adoption of various global standards (such as those for data sharing and security developed by GA4GH) can be employed make this happen (credit: SciAni).
Activity areas
Systems to support virtual cohort assembly (underpinned by Gen3 technology)
User facing (public) interfaces to enable querying data held in participating genome repositories
Common data dictionaries and agreed minimum information standards applied across participating genome repositories
Systems to enable identification of virtual cohorts across multiple participating genome repositories
Interfacing with secure sequence file data storage at each genomics data repository
Providing/expediting safe and secure access to genomics data
Deploying systems to semi-automate User Approvals by Data Access Committees (eg. DUOS, REMS)
User Authentication (AuthN) and Authorisation (AuthZ) systems, with assurance levels appropriate for human genome data (eg. GA4GH Passports and GA4GH Authorization and Authentication Infrastructure [AAI])
Systems to:
Send approved data to approved users
Provide access to approved data through association with approved user’s cloud-based storage
Move compute to the approved dataset(s)
Providing access to connected Cloud analysis platform(s) by:
Linking cloud-based data analysis platform(s) (eg. Cavatica, Illumina Analytics Platform, Terra etc) to approved data.
Deployment of globally harmonised analysis pipelines
Globally federated compute
File and Metadata submission to International EGA Human Genome Data Repository
Ensuring structured phenotype information can easily be produced from participating repositories that observes metadata required by EGA
Systems to automate / semi-automate a production feed of metadata in format required by EGA
Systems in place for streamlined encryption and uploading genome files to the EGA (be it Central or Local) repository
Exploring Local EGA Node(s) in Australia
Study to assess the Local EGA and the feasibility of Local EGA node deployment(s) in Australia from a technical, policy and funding perspective
Community Engagement and Workforce Transition
Resources (including Documentation, Training Materials and Events) to enable:
Researchers and Clinicians to use the systems
IT infrastructure providers elsewhere to deploy the systems
Current Projects
The Australian Cardiovascular disease Data Commons (ACDC)
Project partners:
Past Projects
Global Technologies and Standards for Sharing Human Genome Research Data
Jan 2021 - Nov 2023
Project partners:
Establishing a harmonised data environment for Australian Coronary Artery Disease (CAD) cohorts
Oct 2021 - Dec 2022
Project partners:
Establishing the Gen3 Data Sharing platform technology to enable better Human Genome Data sharing in Australia
Apr 2021 - Sep 2021
Project partners:
Delivering impact to Australian Researchers by participating in a Global Data Commons
2020 - 2021
Project partners: