DANIEL LEE

Contact

BIOINFORMATICIAN

AI/ML Research Scientist, specializing in applying computational methods to analyze, model, and predict in the domains of medical research and protein structures. Experienced in developing and deploying bioinformatics pipelines for large-scale datasets including next-generation sequencing data. Proficient in Python and Linux/Unix, with expertise in experimental design, data interpretation, and machine learning applications.

TECHNICAL SKILLSProgramming

Python, Bash, Go, Java

Packages

Hail, Pymol, Pandas, Numpy, Pytorch
Matplotlib, Scikit-learn

Tools

Git, GCP, Linux (Ubuntu), Conda, Amber
Molecular Dynamics package

EDUCATION

Master of Science in Computational Biology

Carnegie Mellon University
2021 - 2023

Bachelor of Arts in Biology

Grinnell College
2012 - 2018

RESEARCH

Substance Abuse Prediction in Depression

Patients using DeepLearning
Mar 2024 - Ongoing

Project Advisor: Dr.LiRong Wang
University of Pittsburgh Pittsburgh, PA

Designed and implemented Python pipeline for processing Electronic Health Record (EHR) data, adapting BERT transformer to use EHR codes for single disease prediction task on NIH All of Us platform.

Predicted risk of alcohol and substance abuse in at-risk patients with ROAUC of 0.95 and accuracy of 92%, surpassing the baseline logistic regression model by 0.13 in ROAUC and 5% in accuracy.

Hidden Markov Model Guided Predictive
Enzyme Mutagenesis
Oct 2023 - Jan 2024

Project Advisor: Dr.Peng Liu
University of Pittsburgh Pittsburgh, PA

Utilized Amber package to run protein-ligand Molecular Dynamics (MD) simulation and extract time series atomic distance data from MD trajectory data.

Implemented Python pipeline identifying potential key positions at active site for mutations using Hidden- Markov Model (HMM) to improve enzyme-substrate binding specificity based on atomic-distance data.

Preterm Infant Growth Trajectory Prediction
using Microbiome
May 2022 - Aug 2022

Project Advisor:
Ziv Bar-Joseph, Jose Lugo-Martinez
CMU Pittsburgh, PA

Developed a HMM classifier in Python from scratch to predict growth trajectories (normal vs. faltering) in preterm infants using gut microbiome data, demonstrating proficiency in Python and machine learning.

Despite achieving a moderate AUC-ROC of 0.67, provided insights into limitations of HMM in sparse microbiome datasets, guiding research towards more suitable model, Input-Output HMM, to improve prediction.

COURSEWORK PROJECTS

Small Molecule Anti-Bacterial Potency
Screening using Active Learning
Feb 2023 - May 2023

Implemented Expected Model Change (EMC) active learning method using logistic regression as base model in Python script, which can be applied to general binary classification tasks with tabular datasets.

Using EMC method, achieved accuracy of 71% in predicting antibacterial effect of small molecules with 40% fewer samples than random sampling to achieve same accuracy, demonstrating efficiency of in reducing sample size during model training while effectively identifying informative samples.

Motif Search in Glioblastoma

Protein-Protein Interaction Network

Feb 2023 - May 2023

Developed a Python pipeline for identifying motifs in glioblastoma protein-protein interaction networks, employing NetworkX package for network generation and sub-graph isomorphism motif searching algorithm.

Designed a comparative analysis of PPI motifs between random and glioblastoma networks, revealing significantly different motif distributions and predicting motif functions through gene/protein ontology analysis.

HMM GUIDED MUTATION

Python 3.7.12, conda 23.11.0

Link to Repository

Hidden Markov Model Guided Amino Acid Mutagenesis

Given enzyme-substrate complex molecular dynamics trajectory and parameter files as input, outputs amino acid suggestions.

Running Python Script

Output File System

Figures related to RMSD indicator data are generated only when HAS_INDICATOR is False.

Overview

This project began with an idea of generating additional information about the enzyme-substrate complex that can be used in choosing beneficial mutation positions, as methods such as directed evolution randomly chooses protein mutation sites.

The aim is to exploit molecular dynamics (MD) data of enzyme-substrate complex and locate amino acid (AA) mutation position in the active site that could increase binding affinity and specificity. Once the positions are located, alternate AA is to be suggested. To acheive this, Hidden Markov Model (HMM) was used to identify optimal positions of the substrate in the active site in the MD simulation and substitutions are proposed based on property and length of the AA.

Pipeline Steps

From MD trajectory data, extract distance measurements between active site AA & substrate atoms.
Using the distance data, train Gaussian Emission HMM and label each frame/timepoint of test data with a hidden state.
Score each hidden state using indicator data and get emission probabilities (means and standard deviations of features for all states) of the highest scoring hidden state.
Suggest AA substitution based on AA properties using the emission probabilities of the high scoring state.

Input: parameters.yml

Path to parameters.yml is passed on to main.py using the "-p" or "-parameter" flag. Below shows the description of the input parameters in parameters.yml file.

Output: final_output.csv

Current Residue: original residue at the given residue index of FAP
Nearby 3 substrate atoms: three substrate atoms closest to the current residue It is in form of python dictionary=> atom index: [atom name, residue_alphacarbon_to_atom_mean_distance, standard_deviation]
AA Suggestion and Length: suggested amino acid and the suggested amino acid’s length

Pipeline Description

HMM is used to model the relationship of the active site and the substrate using distance measurements between them from the simulation. Time series distance data extracted from MD trajectory data is used to train Gaussian Emission HMM.

Features of the data consists of (a) the distances from the alpha-carbon of AA found near the substrate to each substrate atom and (b) the distance from side-chain hydrogen bond donor/acceptor atoms to substrate hydrogen bond donor/acceptor atoms. Test data is designed to be sampled at an even interval from beginning to end of MD trajectory in order to survery the entire trajectory data.

Viterbi decoding is then used on the test data to label each time point with a hidden state, grouping the time points based on what normal distributions all the features' values are estimated to have come from at that time point. During catalytic reactions, at the active site, while some AA play role in holding the substrate for the reaction to begin, other AA stabilizes the transition state of the substrate, directly participating in catalysis. Looking into each feature's distance data distribution of the frames labeled with a favorable state, meaning having a high probability of reaction, it would be possible to recognize active site positions with AA that is either too short or too long. Such AA are potential candidates for substitutions.

Number of states, a hyperparameter of HMM, can either be designated or searched using Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) and log likelihood of the joint probability of HMM. Based on the writer's experience, log likelihood (LL) steadily rose then plateaued as the number of states was incremented. So, it is advised to search various number of states and view the AIC, BIC and LL graph generated by the pipeline to find the optimal number of states.

To score each state generated from HMM, indicator data that reflects whether the reaction is more or less likely to occur at each time point is used. If the indicator distance data can be measured from user designated pair or pairs of atoms in the MD trajectory, frames labeled with a state with smallest average distance is filtered out for further downstream analysis.

If no indicator atom pairs are provided, we assume high probability of catalytic reaction at stable conformations of the enzyme-substrate structure and Root Mean Squre Deviation (RMSD) value showing how close or distant each frame is to the stable conformation is used as the indicator data. To find the stable conformations, Principal Component Analysis (PCA) and clustering is conducted on the trajectory coordinate data. K-means is used to cluster the dimension-reduced data points and the frame closest to the center of the cluster is chosen as the most stable conformation.

Final AA suggestion step is based on the each feature's distance data distribution mentioned prior.

When current residue is polar and
if there was hydrogen-bond donor atom included in the 3 nearby substrate atoms
=> suggest either polar/charged amino acids whose length is within bound
if there was no hydrogen-bond donor atom included in the 3 nearby substrate atom
=> suggest any amino acids whose length is within bound\

When current residue is non-polar and
if there was hydrogen-bond donor atom included in the 3 nearby substrate atoms
=> suggest either polar/charged amino acids whose length is within bound
if there was no hydrogen-bond donor atom included in the 3 nearby substrate atoms
=> suggest non-polar amino acids whose length is within bound\

bound = 1 standard deviation from mean

Notes

hmm_guided_mutation script loads MD trajectory data and therefore could take up a lot of RAM.

License

Project is available under the MIT license found in the LICENSE file.