DANIEL LEE

Contact

BIOINFORMATICIAN

AI/ML Research Scientist, specializing in applying computational methods to analyze, model, and predict in the domains of medical research and protein structures. Experienced in developing and deploying bioinformatics pipelines for large-scale datasets including next-generation sequencing data. Proficient in Python and Linux/Unix, with expertise in experimental design, data interpretation, and machine learning applications.

TECHNICAL SKILLSProgramming

Python, Bash, Go, Java

Packages

Hail, Pymol, Pandas, Numpy, Pytorch
Matplotlib, Scikit-learn

Tools

Git, GCP, Linux (Ubuntu), Conda, Amber
Molecular Dynamics package

EDUCATION

Master of Science in Computational Biology

Carnegie Mellon University
2021 - 2023

Bachelor of Arts in Biology

Grinnell College
2012 - 2018

RESEARCH

Substance Abuse Prediction in Depression

Patients using DeepLearning
Mar 2024 - Ongoing

Project Advisor: Dr.LiRong Wang
University of Pittsburgh Pittsburgh, PA

Designed and implemented Python pipeline for processing Electronic Health Record (EHR) data, adapting BERT transformer to use EHR codes for single disease prediction task on NIH All of Us platform.

Predicted risk of alcohol and substance abuse in at-risk patients with ROAUC of 0.95 and accuracy of 92%, surpassing the baseline logistic regression model by 0.13 in ROAUC and 5% in accuracy.

Hidden Markov Model Guided Predictive
Enzyme Mutagenesis
Oct 2023 - Jan 2024

Project Advisor: Dr.Peng Liu
University of Pittsburgh Pittsburgh, PA

Utilized Amber package to run protein-ligand Molecular Dynamics (MD) simulation and extract time series atomic distance data from MD trajectory data.

Implemented Python pipeline identifying potential key positions at active site for mutations using Hidden- Markov Model (HMM) to improve enzyme-substrate binding specificity based on atomic-distance data.

Preterm Infant Growth Trajectory Prediction
using Microbiome
May 2022 - Aug 2022

Project Advisor:
Ziv Bar-Joseph, Jose Lugo-Martinez
CMU Pittsburgh, PA

Developed a HMM classifier in Python from scratch to predict growth trajectories (normal vs. faltering) in preterm infants using gut microbiome data, demonstrating proficiency in Python and machine learning.

Despite achieving a moderate AUC-ROC of 0.67, provided insights into limitations of HMM in sparse microbiome datasets, guiding research towards more suitable model, Input-Output HMM, to improve prediction.

COURSEWORK PROJECTS

Small Molecule Anti-Bacterial Potency
Screening using Active Learning
Feb 2023 - May 2023

Implemented Expected Model Change (EMC) active learning method using logistic regression as base model in Python script, which can be applied to general binary classification tasks with tabular datasets.

Using EMC method, achieved accuracy of 71% in predicting antibacterial effect of small molecules with 40% fewer samples than random sampling to achieve same accuracy, demonstrating efficiency of in reducing sample size during model training while effectively identifying informative samples.

Motif Search in Glioblastoma

Protein-Protein Interaction Network

Feb 2023 - May 2023

Developed a Python pipeline for identifying motifs in glioblastoma protein-protein interaction networks, employing NetworkX package for network generation and sub-graph isomorphism motif searching algorithm.

Designed a comparative analysis of PPI motifs between random and glioblastoma networks, revealing significantly different motif distributions and predicting motif functions through gene/protein ontology analysis.

INFANT MICROBIOME HMM

Code designed to run in linux
python 3.7
Ubuntu 22.04.3 LTS


Link to Repository

Background & Purpose

This project is a Hidden Markov Model (HMM) binary classifier that predicts preborn infants (prematurely born babies) as growth normal (GN) or growth faltering (GF) using the infants' gut microbiome community data collected over time. HMM is used to train on the microbiome data of GN and GF infant data separately, producing two HMM outputs. The two outputs are be used to calculate the probability of observations (P(O)) of test samples, classifiying the samples as GN or GF based on higher P(O) value produced by the two models.

HMM is suitable for modeling sequential or time series data, and, in this project, Infant microbiome data was collected at multiple timepoints, labled using post menstral age (PMA) of the infants. During the microbiome abundance sample collection step, the samples were collected at varying timepoints for each infant, so sampling of 10 PMA timepoints was aligned with all samples as much as possible. Each PMA timepoint is 7 days apart. Due to this reason, some samples had missing timepoint data. However, HMM allows learning a model even when there are missing values by taking into account all possible assignments of the hidden state, considering their probability, making HMM a suitable model for handling clinical data with missing values.

The hidden state for each timepoint sample is assigned using MicrobeDMM - Software for fitting Dirichlet multinomial mixtures (DMM) to microbial communities. DMM clustering is a probabilistic method for community detection in microbial samples.

User can run infant_microbiome_hmm.py to simply run HMM without dividing data into two sets based on given labels.

User can run infant_microbiome_two_hmm.py if the user has label data that classifies each infant either as GN or GF. This script runs DMM to assign states to all samples in the combined data, then divides the data into GN and GF set to run HMM separately on the two data sets. HMM outputs are saved in separate folders, saving GN HMM output first, then GF HMM output next.

DMM outputs will be saved in dmm_outputs/dmm_out

HMM outputs will be saved in hmm_outputs/hmm_out
If multiple rounds of code were ran, output of each round will be saved in a different directory with different indexing
Sample microbiome abundance data is given at "infant_microbiome_hmm/data/otu_table_sample.csv"

Sample label is given at "infant_microbiome_hmm/data/class_labels.tsv"

No test data is provided.

DMM Dependencies

Output File System

Input Data

Input data is Operational taxonomic unit (OTU) table csv file.

OTU table in this project contains abundance of bacteria taxa (data type = float)

The first column of OTU should be the "Taxa" column that contains name of bacterial species (data type = string)

Steps to use the code

Link to Clone Code

git clone https://github.com/iland24/infant_microbiome_hmm cd ./infant_microbiome_hmm/

Run initiate_hmm.py

Downloads DMM code zip file
Extracts DMM zip file
Saves parameters.txt at current working directory (which shoulod be infant_microbiome_hmm). parameters.txt file must be read either by infant_microbiome_hmm.py or infant_microbiome_two_hmm.py to run DMM and HMM

sample command:

python3 initiate_hmm.py

After running initiate_hmm.py, user can type in the parameters that will be used in scripts that run DMM and HMM
- User must type in path to input data in the parameters.txt file for running the next script
Default parameter values are set for provided sample data

Run single HMM (a) or two HMM (b)

Run single HMM script to train HMM using microbiome abundance timeseries train data in the infant_microbiome_hmm/data directory.

Run two HMM script if there is label file in the infant_microbiome_hmm/data directory along with the training data. Two HMM script will outputs a pair of HMM outputs. Each of the output can be used to calculate the P(O) of test data using the next script.

The format of the training data and the label file must match the given sample to run these scripts.

(a) infant_microbiome_hmm.py (single HMM output script)

Preprocesses data 1. Select rows (taxa) with highest variance (User can define fraction of the taxa with highest variance) 2. Select 10 timepoint samples of each infant based on start PMA timepoint 3. Drop infants with more than 5 missing timepoint samples
- Runs DMM code to assign states to each sample
- Combines state outputted by DMM code with infant microbiome abundance timeseries train data
- Runs HMM

Input:
parameters.txt
Output:
- dmm_outputs/dmm_out
- hmm_outputs/hmm_out
  - HMM output from given dataset

sample command:

python3 infant_microbiome_hmm.py -p ./parameters.txt

(b) infant_microbiome_two_hmm.py (two HMM output script)

- Preprocesses data
  1. Select rows (taxa) with highest variance (User can define fraction of the taxa with highest variance)
  2. Select 10 timepoint samples of each infant based on start PMA timepoint
  3. Drop infants with more than 5 missing timepoint samples
- Runs DMM code to assign states to each sample
- Combines state outputted by DMM code with infant gut microbiome data
- Separates data into GN and GF sets based on given label data
- Runs HMM on GN data set
- Runs HMM on GF data set

Input:
- parameters.txt
- class_labels.tsv
  - must be a tsv file, not csv
Output:
- dmm_outputs/dmm_out
- hmm_outputs/hmm_out_1
  - HMM output from GN dataset
- hmm_outputs/hmm_out_2
  - HMM output from GF dataset

sample command:

python3 infant_microbiome_two_hmm.py -p ./parameters.txt -l ./data/class_labels.tsv

output_p_of_o.py:

Preprocesses test OTU_table.csv file same way input data was preprocessed in infant_microbiome_hmm.py / infant_microbiome_two_hmm.py.
- This script does not select rows based on variance. Specific rows (taxa) must be preselected by the user before feeding the test data to this script.
Calculates P(O) for all the infants in the test data
Saves ID and correlating P(O) value as a csv file
Input:
- file path to microbiome abundance timeseries test data
- directory path to hmm_out
- path to save output
- number of states used to run DMM and HMM
- PMA start date used to run DMM and HMM
Output:
- csv file with infant id as columns and P(O) (probability of observing data) of each infant as the first row.
  - output_p_of_o.py should be ran using GN and GF HMM outputs
  - Each infant id and P(O) should be compared to make prediction if the infant will be GN or GF

sample command:

python3 output_p_of_o.py -f data/otu_table_test_data.csv -m ./hmm_outputs/hmm_out_1 -o ./prediction.csv -s 6 -p 196

Notes

Additional notes are provided in the beginning of the python scripts.
Caveats to using DMM code:

If problem occur in running DMM code, Error message will be printed out in the terminal: "Error occured while running DMM"
This is possibly due to missing library package for compiling DMM C code. If such is suspected, try installing package at the therminal by typing:
sudo apt-get install libgsl0ldbl

Link

When the dimension of the input data is too big, error occurs in the DMM code. So, my code is designed to check if the column number goes over 1800. DMM code doesn't run with sample data (shape=(444 x 2997)) included in infant_microbiome_hmm code. It helped to rename the column names to integers, shortening each column name.
My code initially checks if the column number of the input data goes over 1800. If it does, it trims the columns based on the starting PMA date first before running DMM. If the column number doens't go over 1800, it runs DMM first before trimmming down columns.

The preset parameters in the parameters.txt file genereated from initiate_hmm.py file are values determined based on the writer's code runs.
Likelihood of the HMM is calculated using Viterbi path, not the full joint probability.
Infant label file must be a tsv file not a csv file.

DANIEL LEE

Contact

BIOINFORMATICIAN

TECHNICAL SKILLSProgramming

Packages

Tools

EDUCATION

RESEARCH

COURSEWORK PROJECTS

INFANT MICROBIOME HMM

Background & Purpose

DMM Dependencies

Output File System Input Data

Steps to use the code

Run initiate_hmm.py

Run single HMM (a) or two HMM (b)

(a) infant_microbiome_hmm.py (single HMM output script)

(b) infant_microbiome_two_hmm.py (two HMM output script)

output_p_of_o.py:

Notes

Output File System

Input Data