ThalInd, a b-Thalassemia and HemoglobinopathiesDatabase for India: Defining a Model Country-Specificand Disease-Centric Bioinformatics Resource
Sujata Sinha,1,2 Michael L. Black,1 Sarita Agarwal,3 Reena Das,4 Alan H. Bittles,1,5 and Matthew Bellgard1
1Centre for Comparative Genomics, Murdoch University, Perth, Australia; 2Thalassemia Working Group, Varanasi, India; 3Sanjay Gandhi Post
Graduate Institute of Medical Sciences, Lucknow, India; 4Postgraduate Institute of Medical Education and Research, Chandigarh, India; 5Edith
Cowan University, Perth, Australia
Communicated by Richard G.H. CottonReceived 19 November 2010; accepted revised manuscript 29 March 2011.
Published online 21 April 2011 in Wiley Online Library (www.wiley.com/humanmutation). DOI 10.1002/humu.21510
ABSTRACT: Web-based informatics resources for geneticdisorders have evolved from genome-wide databases likeOMIM and HGMD to Locus Specific databases(LSDBs) and National and Ethnic Mutation Databases(NEMDBs). However, with the increasing amenability ofgenetic disorders to diagnosis and better management,many previously underreported conditions are emergingas disorders of public health significance. In turn, thegreater emphasis on noncommunicable disorders hasgenerated a demand for comprehensive and relevantdisease-based information from end-users, includingclinicians, patients, genetic epidemiologists, healthadministrators and policymakers. To accommodate thesedemands, country-specific and disease-centric resourcesare required to complement the existing LSDBs andNEMDBs. Currently available preconfigured Web-basedsoftware applications can be customized for this purpose.The present article describes the formulation andconstruction of a Web-based informatics resource forb-thalassemia and other hemoglobinopathies, initially foruse in India, a multiethnic, multireligious country with apopulation approaching 1,200 million. The resourceThalInd (http://ccg.murdoch.edu.au/thalind) has beencreated using the LOVD system, an open sourceplatform-independent database system. The system hasbeen customized to incorporate and accommodate datapertinent to molecular genetics, population genetics,genotypephenotype correlations, disease burden, andinfrastructural assessment. Importantly, the resource alsohas been aligned with the administrative health systemand demographic resources of the country.Hum Mutat 32:887893, 2011. & 2011 Wiley-Liss, Inc.
KEY WORDS: bioinformatics resource; database; b-tha-lassemia; hemoglobinopathies; India ThalInd
The evolution of Web-based informatics resources for geneticdisorders can be traced through three distinct phases: (1) genome-wide mutation databases, (2) locus-specific databases (LSDBs),and (3) national and ethnic mutation databases (NEMDBs).Genome-wide databases, such as OMIM, HGMD, and Ensembl,contain pooled information on all genes and incorporateadvanced tools for gene analysis with user interface. LSDBs weredesigned so that researchers dealing with a specific disease canretrieve current data from a single source and thus need to searchno further than an LSDB [Cotton, 2009]. The majority of LSDBsincorporate tools for the analysis of gene expression and thephenotype in normal and disease conditions [Patrinos, 2006].NEMDBs represent the third phase and were devised to provideinformation on disease-causing mutations and their frequencies indifferent population groups within a country. They can help in theoptimization of molecular diagnostic services and the creation ofappropriate awareness among clinicians, scientists, and the generalpublic about genetic disorders that may be prevalent in differentpopulations and communities [Patrinos, 2006]. The developmentof disease-specific national resources could therefore be consid-ered the next critical phase in the evolution of Web-basedinformatics resources on genetic disorders.
With the increasing amenability of many genetic disordersto prevention, early diagnosis, and better management throughearly intervention and increasing curative options, informaticsresources also need to be extended and up-scaled to providerelevant information to a wide range of potential users, includingclinicians, patients, genetic epidemiologists, health administrators,and policymakers. It also is desirable that data and issues of thisnature can be accommodated within the purview of a nationalhealth system. Disease-specific national resources can accommo-date the requirements of information on disease burden, treat-ment options, and existing facilities for diagnosis and trackingutilization of these services would greatly enhance their relevanceto society.
In this article we introduce and discuss the design of a country-specific bioinformatics resource for a genetic disorder of wide-spread, major public health significance. The highlight of theconceptual model is alignment with national demographic,administrative, and health systems illustrated by its the applicationto b-thalassemia and other hemoglobinopathies, autosomalrecessive disorders adversely affecting the health of large numbersof people worldwide. The resource has been specifically designed
& 2011 WILEY-LISS, INC.
Additional Supporting Information may be found in the online version of this article.Correspondence to: Matthew Bellgard, Centre for Comparative Genomics,
Murdoch University, South Street, Perth, WA 6150, Australia.
for adoption in India, a large and demographically complexcountry, but it has the additional potential to serve as a prototypefor other genetic disorders that impact on health in all low- andmiddle-income countries.
Bioinformatics Requirements of b-Thalassemiaand Hemoglobinopathies in the Indian Context
It has been suggested that in the near future b-thalassemia andrelated disorders are likely to emerge as the category of geneticdisease that will have the most widespread impact on public healthand health resources in India [Agarwal, 2005; Petrou, 2010;Weatherall, 2010]. The autosomal recessive disease b-thalassemiais the most complex disease among the larger group of inheritedhemoglobin disorders. The b-globin gene itself is located onchromosome 11p15.5, with 242 mutations reported [Giardineet al., 2007]. However, expression of the b-globin gene isinfluenced by secondary and tertiary genetic modifiers of thedisease phenotype, resulting in extensive phenotypic diversity[Weatherall, 2001]. The prevalence of symptomatic or clinicallysilent hemoglobin variants such as HbE, HbS, and HbD within thesame population subsets further contribute to diverse phenotypes,resulting in thalassemic hemoglobinopathies and homozygousand compound hemoglobinopathies. Stem cell transplantation isthe only curative option currently available to patients, but in low-income countries its adoption has been restricted because oflimited donor availability, the high costs involved, and the smallnumber of specialist centers. As a result, a large majority ofpatients remain reliant on chronic management regimensinvolving regular blood transfusions and iron chelation therapy,which places a huge burden on the national health resources andon the resources of patients, their families, and communities.Given these circumstances, and the large numbers of patientsinvolved [Sinha et al., 2009], it is envisaged that a comprehensivenational information resource on thalassemia could greatly aidhealthcare delivery and control strategies [IUSSTF, 2007].The occurrence and prevalence of recessive mutations in a
particular population may be dependent on marriage andreproductive practices [Sinha et al., 2009]. The complex, highlystratified structure of the Indian population, characterized by theunique, long-established caste system, has been further compli-cated by multiple waves of immigration and subdivisions based onsix major religions and 22 major spoken languages [Black et al.,2010]. With a multifaceted population history of this nature, athorough knowledge and understanding of local communitystructure is required in order to devise an effective and relevantinformatics resource for b-thalassemia.Health policies in India are formulated at national level and
implemented by individual states according to the directives of thenational government with each state organizing its own healthinfrastructure via a Department for Health and Family Welfare.The decennial Census of India (www.censusindia.gov.in) is themajor national demographic resource, and serves as a referencepoint for most national planning and policy decisions. TheNational Rural Health Mission (NRHM) (http://mohfw.nic.in/nrhm.htm), a major initiative of the national government focuseson improving healthcare delivery to the rural areas where 470%of the population reside.A national bioinformatics resource aligned with the health
administrative system should therefore facilitate: (1) the imple-mentation of prevention and control programs by public healthauthorities; (2) an improvement in the availability and accessibility
of thalassemia care services in all areas rural and urban; (3) clinicalresearch to improve chronic management regimens; and (4) researchin related disciplines, including population genetics, public health,and clinical medicine.
Structuring the Web-Based National Resourcefor b-Thalassemia and Hemoglobinopathies
Given the immense size of the population, the very significantlevel of regional diversity, and the current health administrativesystem, the creation of a pan-Indian Web-based resource created bymerging information generated at State level is a logical progression.The key components required of an effective b-thalassemia Web-based resource are outlined in Figure 1.
The core component of this resource, called ThalInd, is a centralWeb-based curatable database of unique b-globin gene variants,derived from the entry records of patient(s) carrying a pathogenicmutation on one or both alleles. This curatable resource capturesdata from published studies and direct submissions from clinical andresearch laboratories, identifying them as submitters and sub-mission centers, respectively. Data mined from published studiesand those received through direct submission would be analyzed,verified, and curated to transform them into a State level repositoryof information capable of ensuring data output in a user-friendlymanner to meet the needs of a wide and heterogeneous user base.
In creating country specific bioinformatics resources, it isimportant to recognize that, for inherited disorders of publichealth significance, the impact of planning and strategies for healthoutcomes may be evident only over quite lengthy time spans. Ittherefore is imperative that the design of such resources is modularand scalable to accommodate future issues and challenges.
LOVD (Leiden Open Variation Database system), a popularWeb application for LSDBs that was chosen for the currentresource is freely available, platform-independent and is installedon an HTTP server [Fokkema et al., 2005].
The major advantage offered by LOVD for the present definedpurposes is its flexibility for customization and extension, thusenabling the identification and addition of data items required forthe extraction of a wide range of information at the user end. Theprovision of five different access levels, that is, Administrator,Manager, Curators, Submitters, and General User interface, makesit readily applicable to a multicenter system that can be managedby a consortium of expert advisors or curators facilitating datacollection, verification, and submission from various centers andfor dissemination of information to the general public. Further, asand when deemed necessary, the entire dataset can be retrievedand transferred to another suitable Web-based application.
A major limitation in adapting LOVD for an autosomalrecessive disorder like b-thalassemia is in representing both allelesfrom affected individual(s) to retrieve information on genotype.For variant data, LOVD offers a complex system of assigning two-letter codes to each unique variant, which when suffixed to variantallele, denotes the other allele. Due to the lack of a ready alter-native we have adopted this system in the current resource. For thepatient data we have devised a simpler system, which is elaboratedbelow.
Sources of Start-Up Data
The publication of molecular data on thalassemia andhemoglobinopathies in Indian population commenced in the1980s and during the last three decades descriptive publications
888 HUMAN MUTATION, Vol. 32, No. 8, 887893, 2011
from various centers and groups located in different parts of thecountry have followed. A systematic meta-analysis of publisheddata on b-thalassemia undertaken by the authors [Sinha et al.,2009] is the major source of validated and authenticated data forthe current initiative. Unpublished anonymized data fromindividual patients and carriers submitted by one of thesubmission centers (Centre for Genetic Disorders, Banaras HinduUniversity, Varanasi, India) have been included to demonstrate thecapacity of the database to accommodate both published andunpublished data from different sources. Data on the structuralhemoglobin variants HbS, HbE, and HbD from these publishedstudies have also been included as variant alleles. The start-up datatherefore comprise 61 unique variants collated from 9,401 allelesderived from published studies, together with 142 alleles from 105individuals (Supp. Fig. S1).
Customizing the Lovd System
Customizing and formatting a database requires the establish-ment of methods and protocols for data capture, and datacuration and enabling data output in user-friendly formats.Data capture is facilitated by evolving a format for merging
information from different sources, to enable both the high-throughput transfer of data generated in research and referrallaboratories and individual entries from clinical laboratories.A spreadsheet consisting of required data items as columns andindividual entries corresponding to alleles as rows provided anappropriate template for the current LOVD installation. The priorassessment of bioinformatics requirements provided a guide forfiltering available information on disease and incorporatingrelevant information as data items by the addition of specific
custom columns to the variant and patient tables in the LOVDformat.
State is the only mandatory data item that has to be specified forevery entry to be uploaded in the database, other than those requiredby the LOVD system. Variant molecular data have been limited tobasic information on specific variation, including the DNA change,its type, location, pathogenicity, and frequency. Provision has beenmade to record the national and regional frequencies of variantalleles, in addition to those present at State level.
To facilitate queries from a wide set of users, the important dataitems in the patient table include all of the terms commonly usedto refer...