About

Hi! I am a second-year Master's student in Computer Science at Columbia University, advised by Mohammed AlQuraishi. I completed my Bachelor's in Statistics and Computer Science at McGill University in May 2024, where I worked with Jun Ding on ML for single-cell genomics and VAEs for spatial transcriptomics, Anmar Khadra on MCMC methods and probabilistic modeling for binding dynamics simulation, and Hamed Hatami on ML theory, randomized algorithms and communication complexity.

My research focuses on advancing foundation models and their efficient deployment across diverse data modalities:

  • Efficient large-scale model architectures: Foundation models, self-supervised learning, and speculative decoding techniques for computational acceleration
  • Long-context modeling for scientific applications: Large language models for genomic sequences and other high-dimensional biological data
  • Multimodal intelligence: Vision-language models, text-to-video generation, and cross-modal understanding

Professional Experience

The Feinstein Institutes
May 2025 - Sep 2025
Data Scientist Intern
  • Designed comprehensive experimental framework for multi-class response prediction across treatment conditions using PyTorch, Docker, MLflow, and Scikit-learn
  • Built CI/CD-ready PyTorch pipeline with memory-optimized preprocessing and automated quality control, deployed on HPC clusters via Docker and MLflow for reproducible ML workflows
  • Developed comparative evaluation system for spatial clustering algorithms (SpaGCN, Leiden, BANKSY, COVET) with integrated MLflow tracking and interactive Plotly dashboards
  • Achieved 0.85 Silhouette Score improvement in clustering performance with comprehensive visualization and statistical validation
Columbia University - AlQuraishi Lab
Aug 2024 - Present
ML Research Assistant (Prof. Mohammed AlQuraishi)
  • Engineered production-scale genomic classification pipeline using autoregressive Evo model with custom transformer block extraction and MLflow experiment tracking
  • Implemented distributed GPU infrastructure setup for reproducible large-scale genomic inference with automated model versioning
  • Optimized memory-efficient distributed system featuring block-wise compression, float16 precision, and HDF5-based active caching
  • Achieved 10x memory reduction enabling full-genome processing on multi-GPU clusters through manual transformer iteration and parallel batch processing
  • Developed robust cross-validation benchmarking framework with leave-n-species-out evaluation comparing classic AMR prediction model Kover with Evo
McGill University - Ding Lab
Dec 2022 - May 2024
Research Assistant (Prof. Jun Ding)
  • Collaborated with Lady Davis Institute, Segal Cancer Centre, and Dartmouth Cancer Center for large-scale data mining of multi-terabyte single-cell datasets
  • Fine-tuned unsupervised models for clustering PBMCs using PCA and UMAP, discovering pathogenic cell subsets and transcriptional signatures
  • Developed CellSexID tool achieving 96% sex prediction accuracy via ensemble modeling (XGBoost, Random Forest, SVM) and Bayesian hyperparameter optimization
  • Designed Variational Autoencoder with integrated Stochastic Variational Inference for spatial transcriptomics data analysis across diverse tissue types
McGill University - Hatami Group
Sep 2023 - Apr 2024
Research Assistant (Prof. Hamed Hatami)
  • Devised randomized group-testing algorithm protocol to estimate Hamming distance between two n-bit strings, reducing communication complexity upper bound from O(log n) to O(log log n)
  • Investigated excess-error dependent replicability in agnostic learning, designing algorithms for covering hypothesis classes with finite VC dimensions for accurate empirical error approximation
  • Applied adaptive testing algorithms to improve efficiency in randomized Hamming distance estimation with focus on communication-optimal protocols
McGill University - Khadra Lab
Nov 2022 - Jan 2024
Research Assistant (Prof. Anmar Khadra)
  • Implemented advanced probabilistic frameworks in MATLAB to model nanoparticle binding dynamics, correlating IFN-γ dosage and pMHC valence with experimental validation
  • Optimized serial engagement model through Markov Chain Monte Carlo (MCMC) simulations for T-cell activation analysis under geometric constraints
  • Developed MATLAB computational framework leveraging randomization algorithms to calculate binding capacities and visualize TCR surface distribution probabilities for multivalent nanoparticle therapies
Harbin Medical University
May 2021 - Sep 2022
Data Scientist Intern
  • Constructed comprehensive survival analysis framework to estimate targeting uridine-cytidine kinase 2 impact on hepatocellular carcinoma immune response with interactive R monogram for clinical validation
  • Led pivotal SPARC gene study on ovarian cancer prognosis employing Kaplan-Meier survival analysis with log-rank test validation and GSEA pathway enrichment analysis
Yooden Technology
Oct 2020 - Feb 2021
Data Analyst
  • Engineered real-time streaming data pipeline for energy-saving robots in data centers, ingesting multi-sensor streams with SQL-based storage and ETL transformations
  • Implemented anomaly detection logic and automated monitoring systems for proactive power consumption optimization
  • Built interactive dashboards and automated load-balancing alerts using Python (Pandas, NumPy, Matplotlib) with containerized Docker deployment
  • Integrated scheduling workflows enabling real-time energy optimization and improved system reliability across data center operations

Publications

Huilin Tai, Qian Li, Jingtao Wang, Jiahui Tan, Ryann Lang, Basil J. Petrof, Jun Ding
Cell Reports Methods (Accepted, 2025)
Mingxiao Huo, Jiayi Zhang, Hewei Wang, Jinfeng Xu, Zheyu Chen, Huilin Tai, Ian Yijun Chen
TTODLer Workshop at ICML 2025 (2025)
Pengliang Ji, Chuyang Xiao, Huilin Tai, Mingxiao Huo
ACM Multimedia Conference (2024)
Adam M.R. Groh, Nina Caporicci-Dinucci, Brianna Lu, [...], Huilin Tai, Jun Ding, [...], Jo Anne Stratton
Journal of Neurochemistry (2024)
Xiaorong Guo, Huilin Tai, Xiaoqing Li, Peng Liu, Jin Liu, Shan Yu
Clinical and Experimental Obstetrics & Gynecology (2024)
Dehai Wu, Congyi Zhang, [...], Huilin Tai, [...], Sheng Tai
Cellular & Molecular Biology Letters, 105 (2022)

Education

Columbia University
2024-2025
Master of Computer Science
GPA: 4.0/4.0
McGill University
2020-2024
Bachelor of Statistics and Computer Science
GPA: 3.86/4.0

Recent News

Jan 2025
🎉 CellSexID paper accepted at Cell Reports Methods
Jan 2025
📝 Spec-LLaVA accepted at ICML 2025 TTODLer Workshop
Aug 2024
🎓 Started MS at Columbia University
Jul 2024
📚 T2VBench accepted at ACM Multimedia 2024
May 2024
🎊 Graduated from McGill University
Apr 2024
🧠 Published neuroinflammation research in Journal of Neurochemistry

Honors & Awards

🏆 Mackey-Glass Summer Research Bursary
McGill Faculty of Medicine • April 2023
🎓 Hugh Brock Scholarship
McGill University • September 2020
📊 Dean's Honor List
McGill University • Multiple Semesters

Activities

KAUST
May-Sep 2024
Research Associate
Designed generative model architecture with Beta-VAE for video editing. Conducted literature analysis of 10+ papers in generative modeling.
McGill University
Sep 2022-May 2024
Course Assistant
Math235 Algebra, Math240 Discrete Math, Math308 Statistical Learning, Math356 Honor Probability. Graded 20+ assignments across multiple courses.

Contact

Email: ht2666@columbia.edu

Phone: (646) 866-9171

Location: New York, NY

Feel free to reach out for research collaborations, opportunities, or just to connect!