Genome

My Research

My research interests revolve around developing computational methods to help scientists understand our genome. Currently, I am interested in uncovering the role repetitive DNA sequences, known as tandem repeats, play in cancer.

In most cases, repetitive DNA is harmless. But, when a cell copies its DNA, mutations can occur that result in expanding or contracting the tandem repeat. By mutating, they can become pathogenic.

There are multiple challenges with studying repetitive DNA. Right now, I'm focused on building the infrastructure needed to support life science researchers to computationally probe tandem repeats. Some of the issues I aim to address include:

It is difficult to accurately identify repeat expansions in cancer

Problem. Current tools are not well-suited to estimate the sizes of tandem repeats in diseases such as cancer due to chromosomal amplifications.

Solution. I developed a computational tool that looks at the sequencing depth of the regions surrounding the repeat to normalize its estimated size. The tool efficiently ran on hundreds of candidate repeat expansions associated with cancer, and on more than 3000 samples, which amounts to more than 1,200,000 gigabytes of data.

Impact. The tool helps identify repeat expansions that are not real, but a mere artifact of disease. I applied the tool to a list of candidate repeat expansions as part of a research project. As a result, we saw an increase in the accuracy of the list after independent in-vitro validation.

Read the preprint on bioRxiv

Half our genome is repetitive, so it's hard to identify what's relevant

Problem. Many of the short tandem repeats (STRs) in the state-of-the-art catalog are not relevant to disease or changes in phenotype.

Solution. My collaborators at Illumina developed a tool that analyzes the genomes of many individuals in the general population. It identifies STRs that are changing in length across individuals. By comparing this new catalog with previous catalogs, I demonstrated that this new catalog is more likely to be associated with changes in gene expression.

Impact A catalog that is provably more relevant to changes in gene expression than current ones provides the scientific community with a valuable new resource. The new catalog helps researchers identify novel pathogenic and functional STRs and investigate their biological and clinical relevance.

Take a look at the slide deck

Header photo by Warren Umoh on Unsplash