DANIEL LEE

Contact

BIOINFORMATICIAN

AI/ML Research Scientist, specializing in applying computational methods to analyze, model, and predict in the domains of medical research and protein structures. Experienced in developing and deploying bioinformatics pipelines for large-scale datasets including next-generation sequencing data. Proficient in Python and Linux/Unix, with expertise in experimental design, data interpretation, and machine learning applications.

TECHNICAL SKILLSProgramming

Python, Bash, Go, Java

Packages

Hail, Pymol, Pandas, Numpy, Pytorch
Matplotlib, Scikit-learn

Tools

Git, GCP, Linux (Ubuntu), Conda, Amber
Molecular Dynamics package

EDUCATION

Master of Science in Computational Biology

Carnegie Mellon University
2021 - 2023

Bachelor of Arts in Biology

Grinnell College
2012 - 2018

RESEARCH

Substance Abuse Prediction in Depression

Patients using DeepLearning
Mar 2024 - Ongoing

Project Advisor: Dr.LiRong Wang
University of Pittsburgh Pittsburgh, PA

Designed and implemented Python pipeline for processing Electronic Health Record (EHR) data, adapting BERT transformer to use EHR codes for single disease prediction task on NIH All of Us platform.

Predicted risk of alcohol and substance abuse in at-risk patients with ROAUC of 0.95 and accuracy of 92%, surpassing the baseline logistic regression model by 0.13 in ROAUC and 5% in accuracy.

Hidden Markov Model Guided Predictive
Enzyme Mutagenesis
Oct 2023 - Jan 2024

Project Advisor: Dr.Peng Liu
University of Pittsburgh Pittsburgh, PA

Utilized Amber package to run protein-ligand Molecular Dynamics (MD) simulation and extract time series atomic distance data from MD trajectory data.

Implemented Python pipeline identifying potential key positions at active site for mutations using Hidden- Markov Model (HMM) to improve enzyme-substrate binding specificity based on atomic-distance data.

Preterm Infant Growth Trajectory Prediction
using Microbiome
May 2022 - Aug 2022

Project Advisor:
Ziv Bar-Joseph, Jose Lugo-Martinez
CMU Pittsburgh, PA

Developed a HMM classifier in Python from scratch to predict growth trajectories (normal vs. faltering) in preterm infants using gut microbiome data, demonstrating proficiency in Python and machine learning.

Despite achieving a moderate AUC-ROC of 0.67, provided insights into limitations of HMM in sparse microbiome datasets, guiding research towards more suitable model, Input-Output HMM, to improve prediction.

COURSEWORK PROJECTS

Small Molecule Anti-Bacterial Potency
Screening using Active Learning
Feb 2023 - May 2023

Implemented Expected Model Change (EMC) active learning method using logistic regression as base model in Python script, which can be applied to general binary classification tasks with tabular datasets.

Using EMC method, achieved accuracy of 71% in predicting antibacterial effect of small molecules with 40% fewer samples than random sampling to achieve same accuracy, demonstrating efficiency of in reducing sample size during model training while effectively identifying informative samples.

Motif Search in Glioblastoma

Protein-Protein Interaction Network

Feb 2023 - May 2023

Developed a Python pipeline for identifying motifs in glioblastoma protein-protein interaction networks, employing NetworkX package for network generation and sub-graph isomorphism motif searching algorithm.

Designed a comparative analysis of PPI motifs between random and glioblastoma networks, revealing significantly different motif distributions and predicting motif functions through gene/protein ontology analysis.

Expected Model Change Binary Classification

Link to Repository

Expected Model Change (EMC) using Logistic Regression as base model

EMC active machine learning technique meant to choose a sample from pool of data that will bring highest performance increase. The core concept behind EMC is that in order to reduce the model's generalization error, model must change and the sample from data pool that changes the model the most is added at each update step. The model change is approximated by calculating the norm of the gradient of the loss function with respect to model parameters at the selected sample.

In supervised learning, gradient cannot be calculated with the label. To account for cases when data with no label is queried, change in the gradient based on all the possible label values are used to calculate the norm and summed, hence the name expected model change.

This repository implements EMC for a binary classification task using logistic regression model.

As with other active learning methods, EMC has a purpose of cherry-picking a set of samples that could yield high model performance without using the whole avaialble dataset. By logging the performance and the sample id chosen at user-designated logging step, users can analyze the samples that positively impacted the performance.

EMC is prone to outliers in the dataset because outliers tend to bring a large change to the gradient as they are likely to have a large error relative to the model's predictions. So, it is recommended that noisey data should be normalized and preprocessed before using EMC. Another drawback is that EMC is computationally intensive because at every sample-update step, all the samples in the data pool need to be iterated to be added to the model respectively, fitted to the model and gradients calculated to find the sample that changes the gradient the most.

In this repo, EMC is implemented using sklearn's Logistic Regression (LR), and the model only takes binary labeled data as input. Once data is split into test and train sets, train datset is split into two; one split is for initially training the model and the other for creating a data pool from which a sample will be selected through EMC method. A baseline model that selects samples randomly from the data pool is implemented for performance comparison. As for model performance metric, accuracy, precision and recall can be logged in user-designated directory and log interval.

Example config

To address the label-imbalance problem, oversampling function is implemented. To avoid data leakage to the test set from the training set, oversampling is done only on the training set after the label-balanced test set is split up. Users can designate the exact number or the ratio of minor label training samples to oversample through the "train_minor_label_oversample_ratio" parameter in the test_val_train_split_w_oversampling() method in emc.py file.

Dependencies

Comparison of EMC & random sampling metric plots for 2 tasks

Stroke prediction using health record data
Link to Source
Anti-bacterial small molecule screening with molecular fingerprint data

Link to Source

The figure above shows the change in the accuracy, precision and recall of the model at each update step (x-axis); at each step, a sample is picked from the data pool and added to the LR model.

These results are based on the model trained on datasets balanced by undersampling major label train datasets to reduce run time.

We can see from the drug classification that EMC achieves higher performance than the baseline method with smaller amount of samples (update steps), allowing us to identify the samples that are potentially more informative in drawing the decision boundary. From the stroke classification, we observe a tug-of-war between precision and recall; as the precision goes up initially, recall goes down and vice versa as new decision boundary is drawn with the addition of new sample. This possibly indicates that the stroke dataset requires more variance to be able to be classfy samples more precisely.

Performance comparison to no-sampling LR model