The promise to Ramalingaswamy (RLS) fellowship from DBT offered to Subha included application of NGS to address both plant and medical genomics. In 2011 grain amaranths was strategically selected to fulfill one of those promises as it addressed protein malnutrition in India, which is an unfortunate side effect of green revolution. While amaranth was consumed in some parts of India as an exotic alternative to cereal during religious fasting and/or in areas where other cereal crops are hard to grow like Nepal, it remained unknown to the general population who would most benefit from consuming it.
This project is the longest living project in genomics using NGS technologies starting from early 2011 until now. The Figure captures the timeline below and accomplishments above with publications in green.
Initially, finding the grains was a challenge. It was yet another revelation that the grains from the market could be sowed to grow plants. However, lack of literature on the taxonomic classification of grain amaranths in India, demanded that we procure seeds for all three species, hypochondriacus, cruentus and caudatus, from a known vendor using their ornamental names, princess’s feather, autumn touch, love lies bleeding respectively. Comparative taxonomy of the three species under grain amaranths with those procured from the market, all grown to maturity in campus grounds (see embedded figure bottom-left), it was clear that the grain procured from the market was Amaranth hypochondriacus (princess’s feather).
Meeta, one of our PhD students and a botanist by training harvested seeds from all three species and maintained an herbarium. By then, the fever of draft genome from short reads among the NGS community was catching fire across the world. Tools such as SOAPdenovo matured to assemble genomes from short reads using both paired-end and mate-pair libraries. The raw paired-end and mate-pair reads with varying insert sizes from the chromosomal DNA of Amaranth hypochondriacus was generated and assembled by Meeta in 2013.
In 2013, long reads from PacBio were becoming popular in obtaining better quality genome assemblies from eukaryotes. PacBio sequencing services were the unavailable in India. Subha and Vibha managed to send the leaves of A. hypochondriacus to Pullman, Washington to obtain 25X coverage of PacBio reads. However, IBAB did not have a high RAM computer required for error correction of PacBio reads, a prerequisite to assembling. Subha recalls using AWS instances late in 2013 to run ECtools to correct errors in PacBio reads using assembled contigs from short reads. Believe it or not, that month the AWS bill went as high as $2000. This incidence and other server failure created frustration and tears enough to convince the Director to obtain a computer with 1TB RAM early in 2014.
It should be mentioned that tools for assembling error-prone PacBio reads were still in its infancy. The learning curve was steep. Several interns including Nivedita, Sowmya and Savita learnt to assemble and annotate genomes by helping Meeta. However, short read assembly using SOAPdenovo ran much faster on this computer. Soon a manuscript disclosing the draft genome and developmental transcriptome of the first C4 dicot and second member from the order Caryophyllales was published (PMID: 25071079). Very proud moment for IBAB!
The next undertaking was to assemble the transcriptomes using reads from 16 samples representing 4 developmental stages. The 1TB RAM server made this achievable. Using chimeric transcripts from the assembly, we were able to show that one of the key genes in the lysine biosynthetic pathway, DHDPS, fell within 4000 bases to glycosidase gene transcribing in opposite directions, such that their 3’UTRs were overlapping. This proximity was unique to the genome of Amaranthus hypochondriacus, suggesting potential role of glycosidase gene in regulating lysine in seeds. This work was published in 2016 (PMID: 28786999).
By 2015, the PIs of the amaranth project secured a DBT grant funding both sequencing and manpower. This helped in generating sequences of many other landraces of grain amaranth from India with unknown taxonomy including the Suvarna marketed by GKVK.
By 2016, improved tools to correct errors in PacBio reads were available. Meeta was able to procure 12.5% corrected reads using a tool called CANU. The corrected reads were assembled using two tools such as CANU and FLYE, which were then merged using QuickMerge tool to improve the L50 from 1885 to 624. By then, availability of chromosome-level assembly of aother strain grown in the US, Plainsman, was publicly available. Using simulated mate-pair reads of increasing insert sizes from the genome of Plainsman, the L50 was reduced to 56. We also used publicly available HiC data from Plainsman to improve the L50 to 20. This assembly was published in 2020 along with extensive genomic classification of the IBAB variety (Ah-white) with other accessions of grain amaranths from India, which were sequenced at the BioIT center (PMID: 33262776). For the first time, many landraces of grain amaranths from India could be classified along with the known accessions from a collection of amaranth varieties maintained at Amaranth Institute. Most importantly, our work showed that the taxonomy of Suvarna, marketed by GKVK, is Amaranthus cruentus.
In 2018, PIs of the amaranth project at IBAB secured a grant in collaboration with ISSER, Tirupati and University of Hyderabad to identify genes implicated in oil content, seed size and other desirable phenotypes using a technology called TILLING. This work is currently ongoing. Some landraces that showed differential phenotypes have now been sequenced.
As early as in September 2010, NGS technologies was still in its infancy with barely a few datasets from cancer samples in the public repository useful in training. IBAB found the only RNA-seq dataset from prostate cancer from 3 disease individuals with matched-normal samples for use as control. This dataset was used to train PGDB-2011 and 2012 batches in NGS data analysis. Tasks included extraction of various differential genetic elements including cancer-specific gene/non-coding/splice variant expression and SNPs. Considering that tools/pipelines for analysis were still unavailable, students had to develop methods and write programs to extract biologically meaningful entities from the RNA-seq dataset. We dared publish our results providing us with much confidence about our training process required to produce national and international level capacity in NGS data analysis.
In 2013, as the public repository was growing, results from the previous dataset were validated with results from other datasets deposited by other investigators pertaining to prostate cancer. Our first effort was to identify pairs of differentially coregulated genes and non-coding genes from the same locus, in prostate cancer. One of the embedded figure C in the timeline above shows such pairs (PMID: 25933431), which were both up- and down-regulated. Since Subha was a PI at NIH, she had also secured access to a large control dataset from prostate cancer, which was used to validate these findings.
One of the pairs mentioned above had a gene that played a role in androgen transport (ABCC4) and the corresponding non-coding gene (PCAT92) from the same locus was already known to be associated with prostate cancer. The hypothesis was that PCAT92 may play a role in prostate cancer by regulating ABCC4 expression. Deciphering the mechanism of ABCC4 expression regulated by PCAT92 was to become part of a PhD thesis. Figure D in the timeline above shows that PCAT92 recruits ZIC2, a transcription factor, to the site by simultaneously binding to the both the chromosomal DNA near ABCC4 promoter site and PCAT92 to aid ABCC4 expression (PMID: 3019775).
Vector Genomics (malaria)
Malaria remains a global threat despite extensive control efforts in the past by insecticides, bednet etc. The species, An. stephensi, from India, has adapted to urban settings and are now cited in emerging urbanizing world including Africa. The change in climate is also forcing malaria vectors to move to new geographical areas with unknown consequences. The goal of the project is to use genomics in conjunction with state-of-the-art technologies offering alternative approaches for controlling malaria.
This project was undertaken by IBAB in full collaboration and financial support from TIGS (Tata Institute for. Genetics and Society). The aim was to decipher high-quality genomes of multiple strains of An. stephensi displaying varying vectorial capacity. Also, the collaboration included sequencing large number of individuals from diverse geographical locations to assess the impact of the state-of-the-art gene drive technologies on malaria management in India.
The research infrastructure including BioIT and the capacity built by IBAB until 2018 resulted in IBAB actively participating and contributing to this efforts. The project started in October of 2018 with generous funding from TIGS both to perform sequencing and build capacity in bioinformatics. In the next four years since 2018, despite pandemic related lockdowns, high-quality sequencing and assembly of genomes of 5 distinct strains of An. stephensi, STE2, SD500, UCI strain, IndCh, and IndINT were obtained. Comparative analysis of these genomes resulted in methods to establish genotypes associated with some important phenotypes, which are reported in a number of publications. The timeline of various activities and accomplishments are given below.
The findings from this effort demanding downstream work include:
- Deciphering all olfactory receptors coded by An. stephensi and identifying the one with potential in host recognizing (PMID: 33312190).
- Genes within an inversion region (2Rb), which has been shown to be implicated in insecticide resistance and/or circadian rhythm (PMID: 33568145).
- A gold-standard genome assembly of UCI strain to advance malaria management in India and elsewhere (PMID: 35246568). Interestingly, the genomes of IndCH and UCI are homozygous to the opposite genotype of 2Rb.
- Discovery of Eiger gene, a TNF homolog, for the first time in malaria vectors from within an inversion region (3Li) implicated in plasmodium resistance and desiccation (PMID: 36351999). Eiger and its two receptors, wengen and grnd, are modeled using RoseTTAFold to identify amino acids near interaction sites.
- Method development for obtaining chromosome level assembly from multiple draft assemblies of ST2 and SDA500.
- Method to genotype 2Rb inversion directly from transcriptome data aiding correlation of the 2Rb genotype with gene expression.