DANIEL LEE

Contact

BIOINFORMATICIAN

AI/ML Research Scientist, specializing in applying computational methods to analyze, model, and predict in the domains of medical research and protein structures. Experienced in developing and deploying bioinformatics pipelines for large-scale datasets including next-generation sequencing data. Proficient in Python and Linux/Unix, with expertise in experimental design, data interpretation, and machine learning applications.

TECHNICAL SKILLSProgramming

Python, Bash, Go, Java

Packages

Hail, Pymol, Pandas, Numpy, Pytorch
Matplotlib, Scikit-learn

Tools

Git, GCP, Linux (Ubuntu), Conda, Amber
Molecular Dynamics package

EDUCATION

Master of Science in Computational Biology

Carnegie Mellon University
2021 - 2023

Bachelor of Arts in Biology

Grinnell College
2012 - 2018

RESEARCH

Substance Abuse Prediction in Depression

Patients using DeepLearning
Mar 2024 - Ongoing

Project Advisor: Dr.LiRong Wang
University of Pittsburgh Pittsburgh, PA

Designed and implemented Python pipeline for processing Electronic Health Record (EHR) data, adapting BERT transformer to use EHR codes for single disease prediction task on NIH All of Us platform.

Predicted risk of alcohol and substance abuse in at-risk patients with ROAUC of 0.95 and accuracy of 92%, surpassing the baseline logistic regression model by 0.13 in ROAUC and 5% in accuracy.

Hidden Markov Model Guided Predictive
Enzyme Mutagenesis
Oct 2023 - Jan 2024

Project Advisor: Dr.Peng Liu
University of Pittsburgh Pittsburgh, PA

Utilized Amber package to run protein-ligand Molecular Dynamics (MD) simulation and extract time series atomic distance data from MD trajectory data.

Implemented Python pipeline identifying potential key positions at active site for mutations using Hidden- Markov Model (HMM) to improve enzyme-substrate binding specificity based on atomic-distance data.

Preterm Infant Growth Trajectory Prediction
using Microbiome
May 2022 - Aug 2022

Project Advisor:
Ziv Bar-Joseph, Jose Lugo-Martinez
CMU Pittsburgh, PA

Developed a HMM classifier in Python from scratch to predict growth trajectories (normal vs. faltering) in preterm infants using gut microbiome data, demonstrating proficiency in Python and machine learning.

Despite achieving a moderate AUC-ROC of 0.67, provided insights into limitations of HMM in sparse microbiome datasets, guiding research towards more suitable model, Input-Output HMM, to improve prediction.

COURSEWORK PROJECTS

Small Molecule Anti-Bacterial Potency
Screening using Active Learning
Feb 2023 - May 2023

Implemented Expected Model Change (EMC) active learning method using logistic regression as base model in Python script, which can be applied to general binary classification tasks with tabular datasets.

Using EMC method, achieved accuracy of 71% in predicting antibacterial effect of small molecules with 40% fewer samples than random sampling to achieve same accuracy, demonstrating efficiency of in reducing sample size during model training while effectively identifying informative samples.

Motif Search in Glioblastoma

Protein-Protein Interaction Network

Feb 2023 - May 2023

Developed a Python pipeline for identifying motifs in glioblastoma protein-protein interaction networks, employing NetworkX package for network generation and sub-graph isomorphism motif searching algorithm.

Designed a comparative analysis of PPI motifs between random and glioblastoma networks, revealing significantly different motif distributions and predicting motif functions through gene/protein ontology analysis.

HMM GUIDED MUTATION

Python 3.7.12, conda 23.11.0

Link to Repository

Hidden Markov Model Guided Amino Acid Mutagenesis

Given enzyme-substrate complex molecular dynamics trajectory and parameter files as input, outputs amino acid suggestions.

Running Python Script

Output File System

Figures related to RMSD indicator data are generated only when HAS_INDICATOR is False.

Overview

This project began with an idea of generating additional information about the enzyme-substrate complex that can be used in choosing beneficial mutation positions, as methods such as directed evolution randomly chooses protein mutation sites.

The aim is to exploit molecular dynamics (MD) data of enzyme-substrate complex and locate amino acid (AA) mutation position in the active site that could increase binding affinity and specificity. Once the positions are located, alternate AA is to be suggested. To acheive this, Hidden Markov Model (HMM) was used to identify optimal positions of the substrate in the active site in the MD simulation and substitutions are proposed based on property and length of the AA.

Pipeline Steps

From MD trajectory data, extract distance measurements between active site AA & substrate atoms.
Using the distance data, train Gaussian Emission HMM and label each frame/timepoint of test data with a hidden state.
Score each hidden state using indicator data and get emission probabilities (means and standard deviations of features for all states) of the highest scoring hidden state.
Suggest AA substitution based on AA properties using the emission probabilities of the high scoring state.

Input: parameters.yml

Path to parameters.yml is passed on to main.py using the "-p" or "-parameter" flag. Below shows the description of the input parameters in parameters.yml file.

Output: final_output.csv

Current Residue: original residue at the given residue index of FAP
Nearby 3 substrate atoms: three substrate atoms closest to the current residue It is in form of python dictionary=> atom index: [atom name, residue_alphacarbon_to_atom_mean_distance, standard_deviation]
AA Suggestion and Length: suggested amino acid and the suggested amino acid’s length

Pipeline Description

HMM is used to model the relationship of the active site and the substrate using distance measurements between them from the simulation. Time series distance data extracted from MD trajectory data is used to train Gaussian Emission HMM.

Features of the data consists of (a) the distances from the alpha-carbon of AA found near the substrate to each substrate atom and (b) the distance from side-chain hydrogen bond donor/acceptor atoms to substrate hydrogen bond donor/acceptor atoms. Test data is designed to be sampled at an even interval from beginning to end of MD trajectory in order to survery the entire trajectory data.

Viterbi decoding is then used on the test data to label each time point with a hidden state, grouping the time points based on what normal distributions all the features' values are estimated to have come from at that time point. During catalytic reactions, at the active site, while some AA play role in holding the substrate for the reaction to begin, other AA stabilizes the transition state of the substrate, directly participating in catalysis. Looking into each feature's distance data distribution of the frames labeled with a favorable state, meaning having a high probability of reaction, it would be possible to recognize active site positions with AA that is either too short or too long. Such AA are potential candidates for substitutions.

Number of states, a hyperparameter of HMM, can either be designated or searched using Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) and log likelihood of the joint probability of HMM. Based on the writer's experience, log likelihood (LL) steadily rose then plateaued as the number of states was incremented. So, it is advised to search various number of states and view the AIC, BIC and LL graph generated by the pipeline to find the optimal number of states.

To score each state generated from HMM, indicator data that reflects whether the reaction is more or less likely to occur at each time point is used. If the indicator distance data can be measured from user designated pair or pairs of atoms in the MD trajectory, frames labeled with a state with smallest average distance is filtered out for further downstream analysis.

If no indicator atom pairs are provided, we assume high probability of catalytic reaction at stable conformations of the enzyme-substrate structure and Root Mean Squre Deviation (RMSD) value showing how close or distant each frame is to the stable conformation is used as the indicator data. To find the stable conformations, Principal Component Analysis (PCA) and clustering is conducted on the trajectory coordinate data. K-means is used to cluster the dimension-reduced data points and the frame closest to the center of the cluster is chosen as the most stable conformation.

Final AA suggestion step is based on the each feature's distance data distribution mentioned prior.

When current residue is polar and
if there was hydrogen-bond donor atom included in the 3 nearby substrate atoms
=> suggest either polar/charged amino acids whose length is within bound
if there was no hydrogen-bond donor atom included in the 3 nearby substrate atom
=> suggest any amino acids whose length is within bound\

When current residue is non-polar and
if there was hydrogen-bond donor atom included in the 3 nearby substrate atoms
=> suggest either polar/charged amino acids whose length is within bound
if there was no hydrogen-bond donor atom included in the 3 nearby substrate atoms
=> suggest non-polar amino acids whose length is within bound\

bound = 1 standard deviation from mean

Notes

hmm_guided_mutation script loads MD trajectory data and therefore could take up a lot of RAM.

License

Project is available under the MIT license found in the LICENSE file.

Expected Model Change Binary Classification

Link to Repository

Expected Model Change (EMC) using Logistic Regression as base model

EMC active machine learning technique meant to choose a sample from pool of data that will bring highest performance increase. The core concept behind EMC is that in order to reduce the model's generalization error, model must change and the sample from data pool that changes the model the most is added at each update step. The model change is approximated by calculating the norm of the gradient of the loss function with respect to model parameters at the selected sample.

In supervised learning, gradient cannot be calculated with the label. To account for cases when data with no label is queried, change in the gradient based on all the possible label values are used to calculate the norm and summed, hence the name expected model change.

This repository implements EMC for a binary classification task using logistic regression model.

As with other active learning methods, EMC has a purpose of cherry-picking a set of samples that could yield high model performance without using the whole avaialble dataset. By logging the performance and the sample id chosen at user-designated logging step, users can analyze the samples that positively impacted the performance.

EMC is prone to outliers in the dataset because outliers tend to bring a large change to the gradient as they are likely to have a large error relative to the model's predictions. So, it is recommended that noisey data should be normalized and preprocessed before using EMC. Another drawback is that EMC is computationally intensive because at every sample-update step, all the samples in the data pool need to be iterated to be added to the model respectively, fitted to the model and gradients calculated to find the sample that changes the gradient the most.

In this repo, EMC is implemented using sklearn's Logistic Regression (LR), and the model only takes binary labeled data as input. Once data is split into test and train sets, train datset is split into two; one split is for initially training the model and the other for creating a data pool from which a sample will be selected through EMC method. A baseline model that selects samples randomly from the data pool is implemented for performance comparison. As for model performance metric, accuracy, precision and recall can be logged in user-designated directory and log interval.

Example config

To address the label-imbalance problem, oversampling function is implemented. To avoid data leakage to the test set from the training set, oversampling is done only on the training set after the label-balanced test set is split up. Users can designate the exact number or the ratio of minor label training samples to oversample through the "train_minor_label_oversample_ratio" parameter in the test_val_train_split_w_oversampling() method in emc.py file.

Dependencies

Comparison of EMC & random sampling metric plots for 2 tasks

Stroke prediction using health record data
Link to Source
Anti-bacterial small molecule screening with molecular fingerprint data

Link to Source

The figure above shows the change in the accuracy, precision and recall of the model at each update step (x-axis); at each step, a sample is picked from the data pool and added to the LR model.

These results are based on the model trained on datasets balanced by undersampling major label train datasets to reduce run time.

We can see from the drug classification that EMC achieves higher performance than the baseline method with smaller amount of samples (update steps), allowing us to identify the samples that are potentially more informative in drawing the decision boundary. From the stroke classification, we observe a tug-of-war between precision and recall; as the precision goes up initially, recall goes down and vice versa as new decision boundary is drawn with the addition of new sample. This possibly indicates that the stroke dataset requires more variance to be able to be classfy samples more precisely.

Performance comparison to no-sampling LR model

INFANT MICROBIOME HMM

Code designed to run in linux
python 3.7
Ubuntu 22.04.3 LTS


Link to Repository

Background & Purpose

This project is a Hidden Markov Model (HMM) binary classifier that predicts preborn infants (prematurely born babies) as growth normal (GN) or growth faltering (GF) using the infants' gut microbiome community data collected over time. HMM is used to train on the microbiome data of GN and GF infant data separately, producing two HMM outputs. The two outputs are be used to calculate the probability of observations (P(O)) of test samples, classifiying the samples as GN or GF based on higher P(O) value produced by the two models.

HMM is suitable for modeling sequential or time series data, and, in this project, Infant microbiome data was collected at multiple timepoints, labled using post menstral age (PMA) of the infants. During the microbiome abundance sample collection step, the samples were collected at varying timepoints for each infant, so sampling of 10 PMA timepoints was aligned with all samples as much as possible. Each PMA timepoint is 7 days apart. Due to this reason, some samples had missing timepoint data. However, HMM allows learning a model even when there are missing values by taking into account all possible assignments of the hidden state, considering their probability, making HMM a suitable model for handling clinical data with missing values.

The hidden state for each timepoint sample is assigned using MicrobeDMM - Software for fitting Dirichlet multinomial mixtures (DMM) to microbial communities. DMM clustering is a probabilistic method for community detection in microbial samples.

User can run infant_microbiome_hmm.py to simply run HMM without dividing data into two sets based on given labels.

User can run infant_microbiome_two_hmm.py if the user has label data that classifies each infant either as GN or GF. This script runs DMM to assign states to all samples in the combined data, then divides the data into GN and GF set to run HMM separately on the two data sets. HMM outputs are saved in separate folders, saving GN HMM output first, then GF HMM output next.

DMM outputs will be saved in dmm_outputs/dmm_out

HMM outputs will be saved in hmm_outputs/hmm_out
If multiple rounds of code were ran, output of each round will be saved in a different directory with different indexing
Sample microbiome abundance data is given at "infant_microbiome_hmm/data/otu_table_sample.csv"

Sample label is given at "infant_microbiome_hmm/data/class_labels.tsv"

No test data is provided.

DMM Dependencies

Output File System

Input Data

Input data is Operational taxonomic unit (OTU) table csv file.

OTU table in this project contains abundance of bacteria taxa (data type = float)

The first column of OTU should be the "Taxa" column that contains name of bacterial species (data type = string)

Steps to use the code

Link to Clone Code

git clone https://github.com/iland24/infant_microbiome_hmm cd ./infant_microbiome_hmm/

Run initiate_hmm.py

Downloads DMM code zip file
Extracts DMM zip file
Saves parameters.txt at current working directory (which shoulod be infant_microbiome_hmm). parameters.txt file must be read either by infant_microbiome_hmm.py or infant_microbiome_two_hmm.py to run DMM and HMM

sample command:

python3 initiate_hmm.py

After running initiate_hmm.py, user can type in the parameters that will be used in scripts that run DMM and HMM
- User must type in path to input data in the parameters.txt file for running the next script
Default parameter values are set for provided sample data

Run single HMM (a) or two HMM (b)

Run single HMM script to train HMM using microbiome abundance timeseries train data in the infant_microbiome_hmm/data directory.

Run two HMM script if there is label file in the infant_microbiome_hmm/data directory along with the training data. Two HMM script will outputs a pair of HMM outputs. Each of the output can be used to calculate the P(O) of test data using the next script.

The format of the training data and the label file must match the given sample to run these scripts.

(a) infant_microbiome_hmm.py (single HMM output script)

Preprocesses data 1. Select rows (taxa) with highest variance (User can define fraction of the taxa with highest variance) 2. Select 10 timepoint samples of each infant based on start PMA timepoint 3. Drop infants with more than 5 missing timepoint samples
- Runs DMM code to assign states to each sample
- Combines state outputted by DMM code with infant microbiome abundance timeseries train data
- Runs HMM

Input:
parameters.txt
Output:
- dmm_outputs/dmm_out
- hmm_outputs/hmm_out
  - HMM output from given dataset

sample command:

python3 infant_microbiome_hmm.py -p ./parameters.txt

(b) infant_microbiome_two_hmm.py (two HMM output script)

- Preprocesses data
  1. Select rows (taxa) with highest variance (User can define fraction of the taxa with highest variance)
  2. Select 10 timepoint samples of each infant based on start PMA timepoint
  3. Drop infants with more than 5 missing timepoint samples
- Runs DMM code to assign states to each sample
- Combines state outputted by DMM code with infant gut microbiome data
- Separates data into GN and GF sets based on given label data
- Runs HMM on GN data set
- Runs HMM on GF data set

Input:
- parameters.txt
- class_labels.tsv
  - must be a tsv file, not csv
Output:
- dmm_outputs/dmm_out
- hmm_outputs/hmm_out_1
  - HMM output from GN dataset
- hmm_outputs/hmm_out_2
  - HMM output from GF dataset

sample command:

python3 infant_microbiome_two_hmm.py -p ./parameters.txt -l ./data/class_labels.tsv

output_p_of_o.py:

Preprocesses test OTU_table.csv file same way input data was preprocessed in infant_microbiome_hmm.py / infant_microbiome_two_hmm.py.
- This script does not select rows based on variance. Specific rows (taxa) must be preselected by the user before feeding the test data to this script.
Calculates P(O) for all the infants in the test data
Saves ID and correlating P(O) value as a csv file
Input:
- file path to microbiome abundance timeseries test data
- directory path to hmm_out
- path to save output
- number of states used to run DMM and HMM
- PMA start date used to run DMM and HMM
Output:
- csv file with infant id as columns and P(O) (probability of observing data) of each infant as the first row.
  - output_p_of_o.py should be ran using GN and GF HMM outputs
  - Each infant id and P(O) should be compared to make prediction if the infant will be GN or GF

sample command:

python3 output_p_of_o.py -f data/otu_table_test_data.csv -m ./hmm_outputs/hmm_out_1 -o ./prediction.csv -s 6 -p 196

Notes

Additional notes are provided in the beginning of the python scripts.
Caveats to using DMM code:

If problem occur in running DMM code, Error message will be printed out in the terminal: "Error occured while running DMM"
This is possibly due to missing library package for compiling DMM C code. If such is suspected, try installing package at the therminal by typing:
sudo apt-get install libgsl0ldbl

Link

When the dimension of the input data is too big, error occurs in the DMM code. So, my code is designed to check if the column number goes over 1800. DMM code doesn't run with sample data (shape=(444 x 2997)) included in infant_microbiome_hmm code. It helped to rename the column names to integers, shortening each column name.
My code initially checks if the column number of the input data goes over 1800. If it does, it trims the columns based on the starting PMA date first before running DMM. If the column number doens't go over 1800, it runs DMM first before trimmming down columns.

The preset parameters in the parameters.txt file genereated from initiate_hmm.py file are values determined based on the writer's code runs.
Likelihood of the HMM is calculated using Viterbi path, not the full joint probability.
Infant label file must be a tsv file not a csv file.

MAZE GAME

Exit finding maze game created using pygame.
Implemented using: python 3.11.5, pygame 2.5.1

Link to Repository

This is an exit finding maze game. Mouse and keyboard are needed to play this game. Under the pygame framework, maze generating algorithm is used to generate new maze every time player clicks on a level (easy, medium, hard). There is only one correct path from start to finish.

Maze Generation

Grid frame of the maze is a nested list. Node object was added to the grid and contains directions (or neighbors) which it can go to. Direction to parents and the selected directions each node takes makes the path of the maze. Start and end node is separated when the maze initializes to be in a diagonal position.

After implementing the maze generating algorithm, I learned that I was using the "Hunt and Kill" maze generating algorithm. In the hunt mode, the algorithm branches out until it meets a dead end or the end node. Once it reaches a deadend, it goes into the kill mode, finding new path from visited nodes. If there are no more paths that can be branched out from visited nodes, the algorithm finishes.

More of different types of maze generating algorithms are explained here:

Link

The algorithm was tuned so that it sticks either to the wall or visited nodes when finding paths to make nice looking mazes.

Pygame Framework

To make the maze in pygame, rectangles were generated for the walls, player and the end location. The game was designed so that when the player rectangle collides with the wall it stops moving. So, the player can only move along the path of the maze.

Button.py was feched from the following repository:

Link

Background music and the song that comes out when maze is finished are both from bensound, a copyright free music website.

PPI Network Motif Finding

Computational Genomics

link to Repository

Network Motif Finding

This project aims to implement some motif-finding algorithms on a biological dataset. The dataset used here is the Glioblastoma Protein-Protein Interactions data compiled by BioGRID.

Two types of motif-search strategies are employed:

Motifs are counted using the subgraph census method rather than an exhaustive search. This involves taking a large number of subsamples of the original network and searching for motifs within those subgraphs to create a frequency distribution of motif counts. This should take on a Gaussian form which can be compared with random networks using a standard Z-test. We use randomly generated Barabasi-Albert (BA) and Erdos-Renyi networks here to create "background" frequency distributions.

Finally, biological significance of the highest occurring size-k motifs in the undirected network is assessed. This is done by choosing the motifs with the highest frequency, taking all the proteins found in every occurrence, and passing this list to the Gene Ontology (GO) Resource. A list of all proteins found in the glioblastoma network is also used as a reference list to calculate the enrichment of terms in specific motifs. We find highly enriched and statistically significant Biological Processes (BP) relating to cell signaling, protein modifications and metabolism, as well as PANTHER pathways relating to epithelial-mesenchymal transition (EMT) and brain-specific processes such as axon guidance and dopamine signaling.

The code used in this project, the results and the written report are available in this repository.

DANIEL LEE

Contact

BIOINFORMATICIAN

TECHNICAL SKILLSProgramming

Packages

Tools

EDUCATION

RESEARCH

COURSEWORK PROJECTS

HMM GUIDED MUTATION

Hidden Markov Model Guided Amino Acid Mutagenesis

Running Python Script

Output File System

Overview

Input: parameters.yml

Output: final_output.csv

Pipeline Description

Notes

License

Expected Model Change Binary Classification

Expected Model Change (EMC) using Logistic Regression as base model

Example config

Dependencies

Comparison of EMC & random sampling metric plots for 2 tasks

Performance comparison to no-sampling LR model

INFANT MICROBIOME HMM

Background & Purpose

DMM Dependencies

Output File System Input Data

Steps to use the code

Run initiate_hmm.py

Run single HMM (a) or two HMM (b)

(a) infant_microbiome_hmm.py (single HMM output script)

(b) infant_microbiome_two_hmm.py (two HMM output script)

output_p_of_o.py:

Notes

MAZE GAME

Maze Generation

Pygame Framework

PPI Network Motif Finding

Computational Genomics

Network Motif Finding

Output File System

Input Data