Supercomputing access powers development of a new multi-omics resource
A new bioinformatics platform to support multi omics has been released by Australian members of the International Cannabis Genomics Research Consortium (ICGRC). The web accessible platform is designed for data sharing, hosting and analysis and is freely available for the global cannabis research community to use.
Researchers from Southern Cross University, Locedie Manseuto and Dr Ramil Mauleon wanted to build an authoritative, open-science focused, web portal that would support multi-omics research on Cannabis sativa. Leveraging their ongoing connection to Australian BioCommons via the multi-omics community, Loc and Mau sought support from the Australian BioCommons Leadership Share (ABLeS) to help build a key feature of the web portal: CannSeek. The CannSeek database contains approximately 100 million single-nucleotide polymorphisms (SNPs, pronounced ‘snips’). As part of his PhD thesis, Loc produced CannSeek with support from the Australian BioCommons Leadership Share (ABLeS):
We needed access to a large supercomputing allocation to allow us to analyse the entire collection of multi-sample Cannabis sativa next-generation sequencing data available in NCBI (over 2,500 samples in Dec 2022). Our allocation on NCI’s Gadi supercomputer via ABLeS was instrumental in analysing such a large quantity of sequence data.
Loc continued to work on Gadi to optimise a variant calling pipeline that combines both GATK and Parabricks software.
We deployed the optimised variant calling pipeline to compare three reference genomes with the 2,500 next-generation sequencing samples. We identified 90-100 million SNPs that form the CannSeek database. Compare this to the roughly 30 million SNPs from 3,000 rice samples and you can see why we needed the supercomputing resources!
Mau notes that the computational challenges didn’t end there:
The Gadi supercomputer was equally instrumental in solving our next challenge of finding a small (~1,500) subset of SNPs from the CannSeek database that would allow fingerprinting and differentiation of samples from the Cannabis sativa population. The small subset was critical to prepare, as it’s far too expensive to use ~30M SNPs for routine sample fingerprinting!
The ICGRC web portal contains several other omics tools like a JBrowse genome browser, a gene function database search, and an expression heatmap.
Learn more about the development of the ICGRC portal and the CannSeek database:
Learn more about ABLeS on our website, or watch a 10 minute overview of the service.