BIOINFORMATICIAN
AI/ML Research Scientist, specializing in applying computational methods to analyze, model, and predict in the domains of medical research and protein structures. Experienced in developing and deploying bioinformatics pipelines for large-scale datasets including next-generation sequencing data. Proficient in Python and Linux/Unix, with expertise in experimental design, data interpretation, and machine learning applications.
TECHNICAL SKILLSProgramming
Python, Bash, Go, Java
Packages
Hail, Pymol, Pandas, Numpy, Pytorch
Matplotlib, Scikit-learn
Matplotlib, Scikit-learn
Tools
Git, GCP, Linux (Ubuntu), Conda, Amber
Molecular Dynamics package
Molecular Dynamics package
EDUCATION
Master of Science in Computational Biology
Carnegie Mellon University
2021 - 2023
2021 - 2023
Bachelor of Arts in Biology
Grinnell College
2012 - 2018
2012 - 2018
RESEARCH
Substance Abuse Prediction in Depression
Patients using DeepLearning
Mar 2024 - Ongoing
Mar 2024 - Ongoing
Project Advisor: Dr.LiRong Wang
University of Pittsburgh Pittsburgh, PA
Designed and implemented Python pipeline for processing Electronic Health Record (EHR) data, adapting BERT transformer to use EHR codes for single disease prediction task on NIH All of Us platform.
Predicted risk of alcohol and substance abuse in at-risk patients with ROAUC of 0.95 and accuracy of 92%, surpassing the baseline logistic regression model by 0.13 in ROAUC and 5% in accuracy.
Hidden Markov Model Guided Predictive
Enzyme Mutagenesis
Oct 2023 - Jan 2024
Enzyme Mutagenesis
Oct 2023 - Jan 2024
Project Advisor: Dr.Peng Liu
University of Pittsburgh Pittsburgh, PA
University of Pittsburgh Pittsburgh, PA
Utilized Amber package to run protein-ligand Molecular Dynamics (MD) simulation and extract time series atomic distance data from MD trajectory data.
Implemented Python pipeline identifying potential key positions at active site for mutations using Hidden- Markov Model (HMM) to improve enzyme-substrate binding specificity based on atomic-distance data.
Preterm Infant Growth Trajectory Prediction
using Microbiome
May 2022 - Aug 2022
using Microbiome
May 2022 - Aug 2022
Project Advisor:
Ziv Bar-Joseph, Jose Lugo-Martinez
CMU Pittsburgh, PA
Ziv Bar-Joseph, Jose Lugo-Martinez
CMU Pittsburgh, PA
Developed a HMM classifier in Python from scratch to predict growth trajectories (normal vs. faltering) in preterm infants using gut microbiome data, demonstrating proficiency in Python and machine learning.
Despite achieving a moderate AUC-ROC of 0.67, provided insights into limitations of HMM in sparse microbiome datasets, guiding research towards more suitable model, Input-Output HMM, to improve prediction.
COURSEWORK PROJECTS
Small Molecule Anti-Bacterial Potency
Screening using Active Learning
Feb 2023 - May 2023
Using EMC method, achieved accuracy of 71% in predicting antibacterial effect of small molecules with 40% fewer samples than random sampling to achieve same accuracy, demonstrating efficiency of in reducing sample size during model training while effectively identifying informative samples.
Motif Search in Glioblastoma
Protein-Protein Interaction Network
Feb 2023 - May 2023
Designed a comparative analysis of PPI motifs between random and glioblastoma networks, revealing significantly different motif distributions and predicting motif functions through gene/protein ontology analysis.
PPI Network Motif Finding
Computational Genomics
link to Repository
Network Motif Finding
This project aims to implement some motif-finding algorithms on a biological dataset. The dataset used here is the Glioblastoma Protein-Protein Interactions data compiled by BioGRID.Two types of motif-search strategies are employed:
- 1. Using predefined motifs: Some known biological motifs are Feed-Forward Loops (FFLs), 3-Cycles, BiFans, Single-Input Module (SIM) and Multiple Input Module (MIM). In particular, we search for FFLs, 3-Cycles and BiFans in a directed version of the graph.
Motifs are counted using the subgraph census method rather than an exhaustive search. This involves taking a large number of subsamples of the original network and searching for motifs within those subgraphs to create a frequency distribution of motif counts. This should take on a Gaussian form which can be compared with random networks using a standard Z-test. We use randomly generated Barabasi-Albert (BA) and Erdos-Renyi networks here to create "background" frequency distributions.
Finally, biological significance of the highest occurring size-k motifs in the undirected network is assessed. This is done by choosing the motifs with the highest frequency, taking all the proteins found in every occurrence, and passing this list to the Gene Ontology (GO) Resource. A list of all proteins found in the glioblastoma network is also used as a reference list to calculate the enrichment of terms in specific motifs. We find highly enriched and statistically significant Biological Processes (BP) relating to cell signaling, protein modifications and metabolism, as well as PANTHER pathways relating to epithelial-mesenchymal transition (EMT) and brain-specific processes such as axon guidance and dopamine signaling.
The code used in this project, the results and the written report are available in this repository.