About
Hi! I am a second-year Master's student in Computer Science at Columbia University, advised by Mohammed AlQuraishi. I completed my Bachelor's in Statistics and Computer Science at McGill University in May 2024, where I worked with Jun Ding on ML for single-cell genomics and VAEs for spatial transcriptomics, Anmar Khadra on MCMC methods and probabilistic modeling for binding dynamics simulation, and Hamed Hatami on ML theory, randomized algorithms and communication complexity.
My research focuses on advancing foundation models and their efficient deployment across diverse data modalities:
- Efficient large-scale model architectures: Foundation models, self-supervised learning, and speculative decoding techniques for computational acceleration
- Long-context modeling for scientific applications: Large language models for genomic sequences and other high-dimensional biological data
- Multimodal intelligence: Vision-language models, text-to-video generation, and cross-modal understanding
Professional Experience
- Designed comprehensive experimental framework for multi-class response prediction across treatment conditions using PyTorch, Docker, MLflow, and Scikit-learn
- Built CI/CD-ready PyTorch pipeline with memory-optimized preprocessing and automated quality control, deployed on HPC clusters via Docker and MLflow for reproducible ML workflows
- Developed comparative evaluation system for spatial clustering algorithms (SpaGCN, Leiden, BANKSY, COVET) with integrated MLflow tracking and interactive Plotly dashboards
- Achieved 0.85 Silhouette Score improvement in clustering performance with comprehensive visualization and statistical validation
- Engineered production-scale genomic classification pipeline using autoregressive Evo model with custom transformer block extraction and MLflow experiment tracking
- Implemented distributed GPU infrastructure setup for reproducible large-scale genomic inference with automated model versioning
- Optimized memory-efficient distributed system featuring block-wise compression, float16 precision, and HDF5-based active caching
- Achieved 10x memory reduction enabling full-genome processing on multi-GPU clusters through manual transformer iteration and parallel batch processing
- Developed robust cross-validation benchmarking framework with leave-n-species-out evaluation comparing classic AMR prediction model Kover with Evo
- Collaborated with Lady Davis Institute, Segal Cancer Centre, and Dartmouth Cancer Center for large-scale data mining of multi-terabyte single-cell datasets
- Fine-tuned unsupervised models for clustering PBMCs using PCA and UMAP, discovering pathogenic cell subsets and transcriptional signatures
- Developed CellSexID tool achieving 96% sex prediction accuracy via ensemble modeling (XGBoost, Random Forest, SVM) and Bayesian hyperparameter optimization
- Designed Variational Autoencoder with integrated Stochastic Variational Inference for spatial transcriptomics data analysis across diverse tissue types
- Devised randomized group-testing algorithm protocol to estimate Hamming distance between two n-bit strings, reducing communication complexity upper bound from O(log n) to O(log log n)
- Investigated excess-error dependent replicability in agnostic learning, designing algorithms for covering hypothesis classes with finite VC dimensions for accurate empirical error approximation
- Applied adaptive testing algorithms to improve efficiency in randomized Hamming distance estimation with focus on communication-optimal protocols
- Implemented advanced probabilistic frameworks in MATLAB to model nanoparticle binding dynamics, correlating IFN-γ dosage and pMHC valence with experimental validation
- Optimized serial engagement model through Markov Chain Monte Carlo (MCMC) simulations for T-cell activation analysis under geometric constraints
- Developed MATLAB computational framework leveraging randomization algorithms to calculate binding capacities and visualize TCR surface distribution probabilities for multivalent nanoparticle therapies
- Constructed comprehensive survival analysis framework to estimate targeting uridine-cytidine kinase 2 impact on hepatocellular carcinoma immune response with interactive R monogram for clinical validation
- Led pivotal SPARC gene study on ovarian cancer prognosis employing Kaplan-Meier survival analysis with log-rank test validation and GSEA pathway enrichment analysis
- Engineered real-time streaming data pipeline for energy-saving robots in data centers, ingesting multi-sensor streams with SQL-based storage and ETL transformations
- Implemented anomaly detection logic and automated monitoring systems for proactive power consumption optimization
- Built interactive dashboards and automated load-balancing alerts using Python (Pandas, NumPy, Matplotlib) with containerized Docker deployment
- Integrated scheduling workflows enabling real-time energy optimization and improved system reliability across data center operations
Publications
Education
Recent News
Honors & Awards
Activities
Contact
Email: ht2666@columbia.edu
Phone: (646) 866-9171
Location: New York, NY
Feel free to reach out for research collaborations, opportunities, or just to connect!