SUPREME: A Cancer Subtype Prediction Methodology by Integrating High-Dimensional Biological Datasets

Jump to: navigation, search

Student: Jeanne Su Mentor: Serdar Bozdag

Project Description:

Cancer, the second leading cause of death in the world, is a complex genetic disease. Every cancer patient is unique in terms of progression of disease and response to treatment. In recent years, vast amount of biological datasets from cancer tissues have been generated to better characterize the cancer biology. With these efforts, subtypes of some cancer types have been discovered and tools to predict the subtype of a new patient have been developed. Several of these studies relied on a single type of biological dataset such as gene expression, DNA methylation and other tools attempted to integrate various datasets.

Students will work with a team of PhD students and the faculty mentor and contribute to various parts of this project.

Students are expected to be proficient in programming. Experience in molecular biology, basic Linux commands and high performance computing is preferred, but not required.

Student learning objectives: After this project, students will

  • Have a basic understanding of molecular biology and high-dimensional biological datasets.
  • Be familiar with R or Python programming language and some bioinformatics libraries in those languages.
  • Learn gather biological data from public repositories
  • Build a computational pipeline that pre-processes and integrates high-dimensional biological datasets
  • Be familiar with data visualization tools to analyze and visualize gene networks
  • Learn methods to evaluate predictive models by computing true positive rate, false positive rate, precision, recall, ROC curves, etc.

Project Goal:

In this study, we aim to develop a cancer subtype prediction methodology called SUPREME that integrates multiple types of biological data to discover novel cancer subtypes, predict subtypes of cancer patients and discover subtype-specific biomarkers. We will test SUPREME on publicly available cancer datasets such as breast cancer dataset from the Cancer Genome Atlas Project.


Week Description
Week 1
  • Research different clustering methods in RStudio
Weeks 2 and 3
  • Continue researching and practicing various clustering methods on genes
  • Read project proposal as well as related papers to the project's topic
  • Become familiar with the specific R libraries used most often in bioinformatics research
Week 4
  • Researched and practiced how to calculate coexpression of genes
  • Read more papers related to the project's topic
  • Researched and practiced how calculate correlation between mRNA and microRNA
Week 5
  • Mini presentation on the research done so far
  • Downloaded and tidied microRNA and transcription factor (TF) datasets of microRNA/TFs and their target genes
  • Created functions for both datasets to input microRNA/TF and output their target gene, and vice versa
Week 6
  • Tidied a different microRNA dataset
  • Continued working on the correlation code
  • Read papers related to gene coexpression
  • Created a classification function that predicts cancer subtype given a gene expression dataset
Week 7
  • Finished correlation code
  • Created a correlation function to compute correlation between 2 datasets
  • Began writing final report
Week 8
  • Created a correlation function to compute pairwise correlation in 1 dataset
  • Began making a poster
  • Downloaded 2 more TF datasets to tidy
Week 9
  • Completed poster
  • Tidied the TF datasets
  • Continued writing final report
Week 10
  • Presented poster at poster session
  • Prepared and present final research presentation
  • Completed final report
  • Normalize gene expression data