Approximately 0.1% (3 million) of the 3 billion DNA bases composing a human genome differs between two individuals. Conventional genetic studies have identified many important mutations causing genetic diseases directly inherited from the parents. Diseases such as cystic fibrosis and hemophilia are caused by mutations in the coding regions of only one gene. Unfortunately, a vast number of genetic diseases such as diabetes and cancer result from the combination of multiple mutations.
Considering that only ~3% of the genome corresponds to genes coding for proteins, the vast majority of genetic variations are located in the non-coding portions of the genome. Switches contributing to the turning of genes on and off are located in these non-coding regions of the genome, but the precise locations of all of these switches are still unclear. Complex genetic diseases tend to be over-represented in affected families, but they usually do not have a clear pattern of inheritance. This could be due to accumulation of subtle deregulations of many genes by mutations in these switches. There are currently efficient bioinformatics methods designed to interpret the mutations in the ~3% of the coding regions in the genome, but the interpretation of the variants in the non-coding portions is still a challenge.
Given the recent advances in sequencing technologies, the complete genome sequences of patients are becoming increasingly accessible. This emphasizes the need for novel and innovative tools to interpret variations in the non-coding portions of the genome. The goal of the present proposal is to contribute to the next revolution in the understanding of human diseases with genetic components by designing a bioinformatics tool that will leverage the thousands of public data sets to facilitate the interpretation of all the genetic variants. We also want to demonstrate the advanced capabilities of VariantIP to characterize the variants involved in complex genetic disease such as kidney cancer.