BIOINFORMATICIAN
AI/ML Research Scientist, specializing in applying computational methods to analyze, model, and predict in the domains of medical research and protein structures. Experienced in developing and deploying bioinformatics pipelines for large-scale datasets including next-generation sequencing data. Proficient in Python and Linux/Unix, with expertise in experimental design, data interpretation, and machine learning applications.
TECHNICAL SKILLSProgramming
Python, Bash, Go, Java
Packages
Hail, Pymol, Pandas, Numpy, Pytorch
Matplotlib, Scikit-learn
Matplotlib, Scikit-learn
Tools
Git, GCP, Linux (Ubuntu), Conda, Amber
Molecular Dynamics package
Molecular Dynamics package
EDUCATION
Master of Science in Computational Biology
Carnegie Mellon University
2021 - 2023
2021 - 2023
Bachelor of Arts in Biology
Grinnell College
2012 - 2018
2012 - 2018
RESEARCH
Substance Abuse Prediction in Depression
Patients using DeepLearning
Mar 2024 - Ongoing
Mar 2024 - Ongoing
Project Advisor: Dr.LiRong Wang
University of Pittsburgh Pittsburgh, PA
Designed and implemented Python pipeline for processing Electronic Health Record (EHR) data, adapting BERT transformer to use EHR codes for single disease prediction task on NIH All of Us platform.
Predicted risk of alcohol and substance abuse in at-risk patients with ROAUC of 0.95 and accuracy of 92%, surpassing the baseline logistic regression model by 0.13 in ROAUC and 5% in accuracy.
Hidden Markov Model Guided Predictive
Enzyme Mutagenesis
Oct 2023 - Jan 2024
Enzyme Mutagenesis
Oct 2023 - Jan 2024
Project Advisor: Dr.Peng Liu
University of Pittsburgh Pittsburgh, PA
University of Pittsburgh Pittsburgh, PA
Utilized Amber package to run protein-ligand Molecular Dynamics (MD) simulation and extract time series atomic distance data from MD trajectory data.
Implemented Python pipeline identifying potential key positions at active site for mutations using Hidden- Markov Model (HMM) to improve enzyme-substrate binding specificity based on atomic-distance data.
Preterm Infant Growth Trajectory Prediction
using Microbiome
May 2022 - Aug 2022
using Microbiome
May 2022 - Aug 2022
Project Advisor:
Ziv Bar-Joseph, Jose Lugo-Martinez
CMU Pittsburgh, PA
Ziv Bar-Joseph, Jose Lugo-Martinez
CMU Pittsburgh, PA
Developed a HMM classifier in Python from scratch to predict growth trajectories (normal vs. faltering) in preterm infants using gut microbiome data, demonstrating proficiency in Python and machine learning.
Despite achieving a moderate AUC-ROC of 0.67, provided insights into limitations of HMM in sparse microbiome datasets, guiding research towards more suitable model, Input-Output HMM, to improve prediction.
COURSEWORK PROJECTS
Small Molecule Anti-Bacterial Potency
Screening using Active Learning
Feb 2023 - May 2023
Using EMC method, achieved accuracy of 71% in predicting antibacterial effect of small molecules with 40% fewer samples than random sampling to achieve same accuracy, demonstrating efficiency of in reducing sample size during model training while effectively identifying informative samples.
Motif Search in Glioblastoma
Protein-Protein Interaction Network
Feb 2023 - May 2023
Designed a comparative analysis of PPI motifs between random and glioblastoma networks, revealing significantly different motif distributions and predicting motif functions through gene/protein ontology analysis.
INFANT MICROBIOME HMM
Code designed to run in linux
python 3.7
Ubuntu 22.04.3 LTS
Link to Repository
HMM is suitable for modeling sequential or time series data, and, in this project, Infant microbiome data was collected at multiple timepoints, labled using post menstral age (PMA) of the infants. During the microbiome abundance sample collection step, the samples were collected at varying timepoints for each infant, so sampling of 10 PMA timepoints was aligned with all samples as much as possible. Each PMA timepoint is 7 days apart. Due to this reason, some samples had missing timepoint data. However, HMM allows learning a model even when there are missing values by taking into account all possible assignments of the hidden state, considering their probability, making HMM a suitable model for handling clinical data with missing values.
- User can run infant_microbiome_hmm.py to simply run HMM without dividing data into two sets based on given labels.
- User can run infant_microbiome_two_hmm.py if the user has label data that classifies each infant either as GN or GF. This script runs DMM to assign states to all samples in the combined data, then divides the data into GN and GF set to run HMM separately on the two data sets. HMM outputs are saved in separate folders, saving GN HMM output first, then GF HMM output next.
- DMM outputs will be saved in dmm_outputs/dmm_out
- HMM outputs will be saved in hmm_outputs/hmm_out
- If multiple rounds of code were ran, output of each round will be saved in a different directory with different indexing
- Sample microbiome abundance data is given at "infant_microbiome_hmm/data/otu_table_sample.csv"
- Sample label is given at "infant_microbiome_hmm/data/class_labels.tsv"
- No test data is provided.
DMM Dependencies
Output File System
Input Data
- Input data is Operational taxonomic unit (OTU) table csv file.
- OTU table in this project contains abundance of bacteria taxa (data type = float)
- The first column of OTU should be the "Taxa" column that contains name of bacterial species (data type = string)
Steps to use the code
Link to Clone Codegit clone https://github.com/iland24/infant_microbiome_hmm cd ./infant_microbiome_hmm/
Run initiate_hmm.py
- Downloads DMM code zip file
- Extracts DMM zip file
- Saves parameters.txt at current working directory (which shoulod be infant_microbiome_hmm). parameters.txt file must be read either by infant_microbiome_hmm.py or infant_microbiome_two_hmm.py to run DMM and HMM
sample command:
python3 initiate_hmm.py
- After running initiate_hmm.py, user can type in the parameters that will be used in scripts that run DMM and HMM
- User must type in path to input data in the parameters.txt file for running the next script
- Default parameter values are set for provided sample data
Run single HMM (a) or two HMM (b)
Run single HMM script to train HMM using microbiome abundance timeseries train data in the infant_microbiome_hmm/data directory.Run two HMM script if there is label file in the infant_microbiome_hmm/data directory along with the training data. Two HMM script will outputs a pair of HMM outputs. Each of the output can be used to calculate the P(O) of test data using the next script.
- The format of the training data and the label file must match the given sample to run these scripts.
(a) infant_microbiome_hmm.py (single HMM output script)
- Preprocesses data 1. Select rows (taxa) with highest variance (User can define fraction of the taxa with highest variance) 2. Select 10 timepoint samples of each infant based on start PMA timepoint 3. Drop infants with more than 5 missing timepoint samples
- Runs DMM code to assign states to each sample
- Combines state outputted by DMM code with infant microbiome abundance timeseries train data
- Runs HMM
- Input:
- parameters.txt
- Output:
- dmm_outputs/dmm_out
- hmm_outputs/hmm_out
- HMM output from given dataset
sample command:
python3 infant_microbiome_hmm.py -p ./parameters.txt
(b) infant_microbiome_two_hmm.py (two HMM output script)
- Preprocesses data
- Select rows (taxa) with highest variance (User can define fraction of the taxa with highest variance)
- Select 10 timepoint samples of each infant based on start PMA timepoint
- Drop infants with more than 5 missing timepoint samples
- Runs DMM code to assign states to each sample
- Combines state outputted by DMM code with infant gut microbiome data
- Separates data into GN and GF sets based on given label data
- Runs HMM on GN data set
- Runs HMM on GF data set
- Preprocesses data
- Input:
- parameters.txt
- class_labels.tsv
- must be a tsv file, not csv
- Output:
- dmm_outputs/dmm_out
- hmm_outputs/hmm_out_1
- HMM output from GN dataset
- hmm_outputs/hmm_out_2
- HMM output from GF dataset
sample command:
python3 infant_microbiome_two_hmm.py -p ./parameters.txt -l ./data/class_labels.tsv
output_p_of_o.py:
- Preprocesses test OTU_table.csv file same way input data was preprocessed in infant_microbiome_hmm.py / infant_microbiome_two_hmm.py.
- This script does not select rows based on variance. Specific rows (taxa) must be preselected by the user before feeding the test data to this script.
- Calculates P(O) for all the infants in the test data
- Saves ID and correlating P(O) value as a csv file
- Input:
- file path to microbiome abundance timeseries test data
- directory path to hmm_out
- path to save output
- number of states used to run DMM and HMM
- PMA start date used to run DMM and HMM
- Output:
- csv file with infant id as columns and P(O) (probability of observing data) of each infant as the first row.
- output_p_of_o.py should be ran using GN and GF HMM outputs
- Each infant id and P(O) should be compared to make prediction if the infant will be GN or GF
- csv file with infant id as columns and P(O) (probability of observing data) of each infant as the first row.
sample command:
python3 output_p_of_o.py -f data/otu_table_test_data.csv -m ./hmm_outputs/hmm_out_1 -o ./prediction.csv -s 6 -p 196
Notes
- Additional notes are provided in the beginning of the python scripts.
- Caveats to using DMM code:
- If problem occur in running DMM code, Error message will be printed out in the terminal: "Error occured while running DMM"
- This is possibly due to missing library package for compiling DMM C code. If such is suspected, try installing package at the therminal by typing:
sudo apt-get install libgsl0ldbl
- When the dimension of the input data is too big, error occurs in the DMM code. So, my code is designed to check if the column number goes over 1800. DMM code doesn't run with sample data (shape=(444 x 2997)) included in infant_microbiome_hmm code. It helped to rename the column names to integers, shortening each column name.
- My code initially checks if the column number of the input data goes over 1800. If it does, it trims the columns based on the starting PMA date first before running DMM. If the column number doens't go over 1800, it runs DMM first before trimmming down columns.
- The preset parameters in the parameters.txt file genereated from initiate_hmm.py file are values determined based on the writer's code runs.
- Likelihood of the HMM is calculated using Viterbi path, not the full joint probability.
- Infant label file must be a tsv file not a csv file.