Mohsen Hariri

AI Scientist

I’m an AI Scientist in the Department of Computer and Data Sciences at Case Western Reserve University. I’m trying to improve the reasoning capabilities and efficiency of large language models.

Teaching Information

Machine learning, large language models, Bayesian statistics, Linear algebra

Education

M.S. in Computer and Data Sciences, Case Western Reserve University, 2025
B.S. in Electrical Engineering, University of Tehran

I did a bachelor’s in electrical engineering and a master’s in computer and data sciences. Currently, I work as an AI scientist in the Department of Computer and Data Sciences at Case Western Reserve University.

Statistical methods in evaluation and ranking of language models, efficient inference and systems for large models, trustworthy and secure foundation-model pipelines.

Current Projects

Reliable Evaluation and Ranking of Reasoning LLMs

Develops Bayesian, geometric, and ranking-based evaluation methods for reasoning models under test-time scaling, with a focus on uncertainty quantification, ranking stability, and compute-efficient benchmarking.

Related Publications:

Efficient LLM Inference, Compression, and Systems

Builds methods for KV-cache quantization, lossless model compression, and online anomaly detection to make LLM inference and scientific infrastructure more efficient and reliable.

Related Publications:

Trustworthy and Secure Foundation Models

Studies robust evaluation, failure analysis, and security risks in modern foundation-model pipelines, including stochastic evaluation protocols and RAG/LoRA attack surfaces.

Related Publications:

Medical Imaging, Multimodal AI, and Clinical Applications

Applies machine learning to medical imaging, visual grounding, segmentation, survival modeling, and radiomics in collaboration with clinical and scientific partners.

Related Publications:

Ranking Reasoning LLMs under Test-Time Scaling

2026

Mohsen Hariri , Michael Hinczewski , Jing Ma , Vipin Chaudhary

ACL 2026 Main

This paper studies how to rank reasoning large language models when evaluation uses multiple stochastic samples per prompt under test-time scaling. It formalizes dense benchmark ranking in this repeated-trial setting and introduces Scorio, a library that implements Bayesian, paired-comparison, psychometric, voting, and spectral ranking methods. Across twenty reasoning models and four Olympiad-style math benchmarks, the study shows that many full-trial rankings closely match a Bayesian gold standard while low-budget methods can be less stable. The results provide practical guidance for reliable model ranking under both high- and low-budget evaluation settings.

Artificial Intelligence Trustworthy AI

arXiv Code

BibTeX Citation

@inproceedings{hariri2026ranking,
  title={Ranking Reasoning LLMs under Test-Time Scaling},
  author={Mohsen Hariri and Michael Hinczewski and Jing Ma and Vipin Chaudhary},
  booktitle={ACL 2026 Main},
  year={2026},
  url={https://arxiv.org/abs/2603.10960}
}

Quantize What Counts: More for Keys, Less for Values

2026

Mohsen Hariri , Alan Luo , Weicong Chen , Shaochen Zhong , Tianyi Zhang , Qifan Wang , Xia Hu , Xiaotian Han , Vipin Chaudhary

ACL 2026 Findings

This work studies asymmetric KV-cache quantization for large language models and shows that key tensors carry more information than value tensors. The analysis motivates allocating more bits and stronger outlier handling to keys than to values, instead of quantizing both sides identically. Experiments show that key-favored bit allocation preserves much more accuracy at the same memory budget. The paper provides both theoretical motivation and practical guidance for more efficient LLM inference.

Artificial Intelligence HPC

arXiv Code View

BibTeX Citation

@inproceedings{hariri2026quantize,
  title={Quantize What Counts: More for Keys, Less for Values},
  author={Mohsen Hariri and Alan Luo and Weicong Chen and Shaochen Zhong and Tianyi Zhang and Qifan Wang and Xia Hu and Xiaotian Han and Vipin Chaudhary},
  booktitle={ACL 2026 Findings},
  year={2026},
  url={https://arxiv.org/abs/2502.15075}
}

Scorio.jl: A Julia package for ranking stochastic responses

2026

Mohsen Hariri , Michael Hinczewski , Vipin Chaudhary

JuliCon 2026

Scorio.jl is a Julia package for evaluating and ranking systems from repeated stochastic responses on shared tasks. It provides a common tensor-based interface for direct score-based, pairwise, psychometric, voting, graph, and listwise ranking methods. The package supports methodological studies of ranking stability as well as day-to-day leaderboard construction. It makes ranking under repeated stochastic observation easier to analyze across different assumptions and ranking families.

Artificial Intelligence HPC

arXiv Code

BibTeX Citation

@inproceedings{hariri2026scoriojl,
  title={Scorio.jl: A Julia package for ranking stochastic responses},
  author={Mohsen Hariri and Michael Hinczewski and Vipin Chaudhary},
  booktitle={JuliCon 2026},
  year={2026},
  url={https://arxiv.org/abs/2603.14103}
}

Empirical evaluation of variability and multi-institutional generalizability of deep learning survival models: Application to renal cancer CT scans

2026

B. Flannery , T. DeSilvio , Mohsen Hariri , A. Sadri , N. Heller , C. Weight , S. Viswanath

Computers in Biology and Medicine

This paper systematically studies how methodological choices affect the robustness and external generalization of CT-based deep learning survival models for renal cancer. It examines data partitioning, data order, random initialization, and augmentation strategies on a multi-institutional cohort spanning nine institutions. The study finds that covariate-balanced partitioning and carefully chosen augmentations materially improve external validation performance, while initialization meaningfully affects variance. These results provide practical recommendations for building more stable and generalizable survival models from medical imaging.

Medical Imaging Computer Vision Artificial Intelligence

DOI Code

BibTeX Citation

@article{flannery2026empirical,
  title={Empirical evaluation of variability and multi-institutional generalizability of deep learning survival models: Application to renal cancer CT scans},
  author={B. Flannery and T. DeSilvio and Mohsen Hariri and A. Sadri and N. Heller and C. Weight and S. Viswanath},
  journal={Computers in Biology and Medicine},
  year={2026},
  doi={10.1016/j.compbiomed.2026.111603},
  volume={206},
  pages={111603}
}

Medical Image Spatial Grounding with Semantic Sampling

2026

Andrew Seohwan Yu , Mohsen Hariri , Kunio Nakamura , Mingrui Yang , Xiaojuan Li , Vipin Chaudhary

MICCAI 2026 (under review)

This work studies spatial grounding for vision-language models in 3D medical imaging, where anatomy, modality, slice direction, and coordinate systems create unique challenges. It introduces MIS-Ground, a benchmark for analyzing failure modes in medical image spatial grounding, and MIS-SemSam, an inference-time semantic sampling method that improves grounding accuracy without retraining. The paper evaluates how visual and textual prompting choices influence grounding performance across clinical imaging settings. It advances reproducible evaluation and practical improvement of medical VLM grounding.

Medical Imaging Computer Vision Artificial Intelligence

arXiv

BibTeX Citation

@misc{yu2026medical,
  title={Medical Image Spatial Grounding with Semantic Sampling},
  author={Andrew Seohwan Yu and Mohsen Hariri and Kunio Nakamura and Mingrui Yang and Xiaojuan Li and Vipin Chaudhary},
  year={2026},
  note={Under review at MICCAI 2026},
  url={https://arxiv.org/abs/2603.14579}
}

Sweeping Promptable Spoofs under the DirtyRAG

2026

Shaochen Zhong , Jiamu Zhang , Hoang Anh Duy Le , Wenya Xie , Yifan Lu , Xintong Sun , Mohsen Hariri , Hongyi Liu , Guanchu Wang , Zhaozhuo Xu , Zirui Liu , Shuai Xu , Ning Xie , Li Li , Rui Chen , Ruixiang Tang , Xia Hu , Vipin Chaudhary

ICML 2026 (under review)

This paper studies security vulnerabilities in retrieval-augmented generation through DirtyRAG, a query-blind benign-passage attack that can be steered by prompting. It shows that promptable spoof passages remain effective against strong defenses and exposes a practical attack surface for real-world RAG systems. The work also introduces RAG-ATag, a benchmark for evaluating RAG security under these attack conditions. It highlights the need for more robust retrieval and generation defenses in deployed LLM systems.

Artificial Intelligence Trustworthy AI

BibTeX Citation

@misc{zhong2026dirtyrag,
  title={Sweeping Promptable Spoofs under the DirtyRAG},
  author={Shaochen Zhong and Jiamu Zhang and Hoang Anh Duy Le and Wenya Xie and Yifan Lu and Xintong Sun and Mohsen Hariri and Hongyi Liu and Guanchu Wang and Zhaozhuo Xu and Zirui Liu and Shuai Xu and Ning Xie and Li Li and Rui Chen and Ruixiang Tang and Xia Hu and Vipin Chaudhary},
  year={2026},
  note={Under review at ICML 2026}
}

K^4-Serve: Robust Streaming Log Anomaly Detection for HPC & AI Infrastructure

2026

W. Chen , V. Singh , Z. Rahmani , D. Ganguly , Mohsen Hariri , S. Maxwell , S. Gajurel , E. Dragowsky , H. Djohari , Vipin Chaudhary

ACM PEARC 2026 (under review)

K^4-Serve operationalizes the K^4 framework for streaming anomaly detection on production HPC and AI infrastructure logs. It combines Kafka-based ingestion, versioned normalization, sliding-window scoring, retraining, and observability features to support robust real-world deployment. The system achieves stable deployment on real HPC logs with near-perfect event-level detection and only one false alert in the reported study. The work bridges anomaly-detection methodology and production cyberinfrastructure practice.

HPC Artificial Intelligence

BibTeX Citation

@misc{chen2026k4serve,
  title={K^4-Serve: Robust Streaming Log Anomaly Detection for HPC \& AI Infrastructure},
  author={W. Chen and V. Singh and Z. Rahmani and D. Ganguly and Mohsen Hariri and S. Maxwell and S. Gajurel and E. Dragowsky and H. Djohari and Vipin Chaudhary},
  year={2026},
  note={Under review at ACM PEARC 2026}
}

Categorical Evaluation of LLMs under Test-Time Scaling

2026

Mohsen Hariri , H. S. Hillsdownley , Vipin Chaudhary

COLM 2026 (under review)

This work argues that binary pass-based metrics are too coarse for evaluating reasoning models under test-time scaling. It introduces a categorical Bayesian framework that scores rubric-defined outcomes with uncertainty rather than collapsing all outputs into pass-or-fail labels. The study shows that lightweight runtime signals can support accurate categorical evaluation without relying on a judge model and that rubric design can materially change model rankings. The paper extends uncertainty-aware LLM evaluation beyond binary correctness.

Artificial Intelligence Trustworthy AI

BibTeX Citation

@misc{hariri2026categorical,
  title={Categorical Evaluation of LLMs under Test-Time Scaling},
  author={Mohsen Hariri and H. S. Hillsdownley and Vipin Chaudhary},
  year={2026},
  note={Under review at COLM 2026}
}

Geom@k: Fast to Converge, Slow to Drift

2026

Mohsen Hariri , Vipin Chaudhary

COLM 2026 (under review)

This paper studies evaluation metrics for test-time scaling by separating answer discovery from repeated correctness. It derives Geom@k and the broader GeoSpectrum@K family from a common hypergeometric view of fixed-budget metrics. Across aggregate settings, Geom@2 provides a strong balance of fast convergence and low ranking drift relative to alternative summaries. The work offers a compute-aware perspective on stable evaluation under repeated sampling.

Artificial Intelligence Trustworthy AI

BibTeX Citation

@misc{hariri2026geomk,
  title={Geom@k: Fast to Converge, Slow to Drift},
  author={Mohsen Hariri and Vipin Chaudhary},
  year={2026},
  note={Under review at COLM 2026}
}

Using AI to Increase Efficiency of Multilingual Test Materials: Spanish BEL Sentences

2026

C. López , L. Calandruccio , E. Buss , Mohsen Hariri , Vipin Chaudhary

Work in progress

This work-in-progress explores how AI can improve the efficiency of creating multilingual auditory test materials, with a focus on Spanish BEL sentences. The project investigates workflow acceleration and quality support for multilingual assessment design. It sits at the intersection of language technology, hearing research, and educational test development. The aim is to reduce manual burden while preserving the validity of test materials.

Artificial Intelligence

BibTeX Citation

@misc{lopez2026spanishbel,
  title={Using AI to Increase Efficiency of Multilingual Test Materials: Spanish BEL Sentences},
  author={C. López and L. Calandruccio and E. Buss and Mohsen Hariri and Vipin Chaudhary},
  year={2026},
  note={Work in progress}
}

Integrating self-configuring and foundational deep learning segmentation models for identifying the anal sphincter complex and perianal fistulae on pelvic MRI

2025

A. Sridharan , T. DeSilvio , B. Flannery , Mohsen Hariri , R. Macbeth , B. Parker , A. Elumalai , J. Devi , A. Lovato , C. Maneiro , A. George , A. Ganapath , P. Deepak , D. H. Ballard , S. E. Viswanath

SPIE Medical Imaging 2025

This paper introduces an automated pelvic MRI segmentation pipeline that combines nnU-Net with MedSAM for identifying perianal fistulae and the anal sphincter complex. The approach leverages self-configuring and foundation- model segmentation components to improve robustness on a difficult clinical anatomy problem. It is designed to support interventional guidance and surgical planning in Crohn's disease. The work demonstrates how task-specific and foundation-model methods can be integrated for clinically useful MRI analysis.

Medical Imaging Computer Vision Artificial Intelligence

DOI

BibTeX Citation

@inproceedings{sridharan2025pelvic,
  title={Integrating self-configuring and foundational deep learning segmentation models for identifying the anal sphincter complex and perianal fistulae on pelvic MRI},
  author={A. Sridharan and T. DeSilvio and B. Flannery and Mohsen Hariri and R. Macbeth and B. Parker and A. Elumalai and J. Devi and A. Lovato and C. Maneiro and A. George and A. Ganapath and P. Deepak and D. H. Ballard and S. E. Viswanath},
  booktitle={SPIE Medical Imaging 2025},
  year={2025},
  doi={10.1117/12.3048987}
}

Radiomics to Detect Inflammation and Fibrosis on Magnetic Resonance Enterography in Stricturing Crohn’s Disease

2024

P. Chirra , J. Sleiman , N. Gandhi , I. Gordon , Mohsen Hariri , M. Baker , R. Ottichilo , D. Bruining , J. Kurowski , S. Viswanath , F. Rieder

Journal of Crohn's and Colitis

This study develops radiomics-based machine learning models to characterize inflammation and fibrosis in Crohn's disease strictures from magnetic resonance enterography. The models improve diagnostic discrimination relative to radiologist visual scoring alone and show additional value when combined with expert assessment. The work addresses an important unmet need in noninvasive characterization of stricturing disease. It supports more quantitative and reproducible imaging-based assessment in inflammatory bowel disease.

Medical Imaging Computer Vision Artificial Intelligence

DOI

BibTeX Citation

@article{chirra2024crohns,
  title={Radiomics to Detect Inflammation and Fibrosis on Magnetic Resonance Enterography in Stricturing Crohn’s Disease},
  author={P. Chirra and J. Sleiman and N. Gandhi and I. Gordon and Mohsen Hariri and M. Baker and R. Ottichilo and D. Bruining and J. Kurowski and S. Viswanath and F. Rieder},
  journal={Journal of Crohn's and Colitis},
  year={2024},
  doi={10.1093/ecco-jcc/jjae073},
  volume={18},
  number={10},
  pages={1660--1671}
}

Spatial attention wavelon network (SpAWN) for survival-based risk stratification in kidney cancers via CT

2024

B. Flannery , T. DeSilvio , A. Sadri , Mohsen Hariri , E. Remer , J. Nguyen , S. Viswanath

SPIE Medical Imaging 2024

SpAWN introduces a survival-risk stratification model for kidney cancer CT that combines spatial attention with wavelon activations. The design aims to improve interpretability and cross-cohort generalization for imaging-based survival prediction. The paper demonstrates that architectural choices tailored to spatial context can strengthen risk modeling from pre-operative scans. It contributes to clinically relevant prognostic modeling in oncologic imaging.

Medical Imaging Computer Vision Artificial Intelligence

DOI

BibTeX Citation

@inproceedings{flannery2024spawn,
  title={Spatial attention wavelon network (SpAWN) for survival-based risk stratification in kidney cancers via CT},
  author={B. Flannery and T. DeSilvio and A. Sadri and Mohsen Hariri and E. Remer and J. Nguyen and S. Viswanath},
  booktitle={Medical Imaging 2024: Computer-Aided Diagnosis},
  year={2024},
  doi={10.1117/12.3008727}
}

Federated Image Quality Assessment of Prostate MRI Scans in a Multi-institutional Setting

2024

Mohsen Hariri , P. Chirra , M. Patel , T. T. Einat , I. Dayan , A. Tonetti , Y. Baror , T. Barrett , N. Sushentsev , J. D. Kaggie , S. Yuan , D. Wu , B. Yu , Z. Lyu , C. Hsu , W. Wang , S. Krishnamurthi , S. E. Viswanath

AACR Annual Meeting 2024

This work addresses image-quality variability in prostate MRI across multiple institutions using a federated analysis setting. It studies how artifact- related quality differences can affect the reliability and portability of downstream machine learning models. The abstract highlights the importance of multi-institutional quality assessment before model development and deployment. It contributes to more reliable imaging AI in federated and heterogeneous clinical environments.

Medical Imaging Artificial Intelligence

DOI

BibTeX Citation

@inproceedings{hariri2024federated,
  title={Federated Image Quality Assessment of Prostate MRI Scans in a Multi-institutional Setting},
  author={Mohsen Hariri and P. Chirra and M. Patel and T. T. Einat and I. Dayan and A. Tonetti and Y. Baror and T. Barrett and N. Sushentsev and J. D. Kaggie and S. Yuan and D. Wu and B. Yu and Z. Lyu and C. Hsu and W. Wang and S. Krishnamurthi and S. E. Viswanath},
  booktitle={Proceedings of the American Association for Cancer Research Annual Meeting 2024},
  year={2024},
  doi={10.1158/1538-7445.AM2024-2344},
  note={Abstract 2344}
}

Deep Learning Based Risk Stratification of Pre-operative CT Scans is Prognostic of Overall Survival in Kidney Cancers

2024

B. Flannery , Mohsen Hariri , T. DeSilvio , A. Sadri , J. Nguyen , E. M. Remer , S. Krishnamurthi , S. E. Viswanath

AACR Annual Meeting 2024

This abstract reports a deep learning model that uses pre-operative CT scans to predict overall survival in kidney cancer. The model improves pre-operative risk assessment and offers prognostic value beyond standard clinicopathological factors. It represents an early step toward multi- institutional imaging biomarkers for survival-based treatment planning. The work supports broader use of CT-derived representations for oncologic prognostication.

Medical Imaging Computer Vision Artificial Intelligence

DOI

BibTeX Citation

@inproceedings{flannery2024kidneyrisk,
  title={Deep Learning Based Risk Stratification of Pre-operative CT Scans is Prognostic of Overall Survival in Kidney Cancers},
  author={B. Flannery and Mohsen Hariri and T. DeSilvio and A. Sadri and J. Nguyen and E. M. Remer and S. Krishnamurthi and S. E. Viswanath},
  booktitle={Proceedings of the American Association for Cancer Research Annual Meeting 2024},
  year={2024},
  doi={10.1158/1538-7445.AM2024-7379},
  note={Abstract 7379}
}

Intra-and Peri-tumoral Radiomic Features are Predictive of Pathologic Response to Multiple Neoadjuvant Therapy Regimen in Rectal Cancers via Pre-treatment MRI

2024

L. Bao , T. DeSilvio , B. N. Parker , Mohsen Hariri , P. Chirra , M. Labbad , S. Tang , G. M. O'Connor , E. Steinhagen , J. L. Miller-Ocuin , A. Gupta , E. L. Marderstein , A. Carroll , M. Crittenden , M. J. Gough , S. Krishnamurthi , K. H. Young , S. E. Viswanath

AACR Annual Meeting 2024

This study analyzes intra-tumoral and peri-tumoral radiomic features from pretreatment MRI to predict pathologic response in rectal cancer. It evaluates whether quantitative imaging phenotypes can identify responders across multiple neoadjuvant treatment regimens. The work aims to improve patient stratification beyond traditional staging and biomarker approaches. It contributes to noninvasive response prediction in rectal cancer management.

Medical Imaging Computer Vision Artificial Intelligence

DOI

BibTeX Citation

@inproceedings{bao2024rectal,
  title={Intra-and Peri-tumoral Radiomic Features are Predictive of Pathologic Response to Multiple Neoadjuvant Therapy Regimen in Rectal Cancers via Pre-treatment MRI},
  author={L. Bao and T. DeSilvio and B. N. Parker and Mohsen Hariri and P. Chirra and M. Labbad and S. Tang and G. M. O'Connor and E. Steinhagen and J. L. Miller-Ocuin and A. Gupta and E. L. Marderstein and A. Carroll and M. Crittenden and M. J. Gough and S. Krishnamurthi and K. H. Young and S. E. Viswanath},
  booktitle={Proceedings of the American Association for Cancer Research Annual Meeting 2024},
  year={2024},
  doi={10.1158/1538-7445.AM2024-2582},
  note={Abstract 2582}
}

Virtual Reality as an Acute Pain Reliever During Laceration Repair in Emergency Departments: A Randomized Controlled Trial

2023

M. Rezai , L. Namdari , D. Farsi , N. Ashayeri , M. Naghshbandi , Mohsen Hariri , R. Ghafoury

Saudi Journal of Emergency Medicine

This randomized controlled trial investigates whether virtual reality can reduce acute pain during laceration repair in emergency departments. Adult patients undergoing repair were studied in a real clinical setting to assess pain relief during the procedure. The work explores immersive VR as a practical non-pharmacologic intervention for emergency care. It adds clinical evidence on the use of interactive technology for procedural pain management.

Artificial Intelligence

DOI

BibTeX Citation

@article{rezai2023vr,
  title={Virtual Reality as an Acute Pain Reliever During Laceration Repair in Emergency Departments: A Randomized Controlled Trial},
  author={M. Rezai and L. Namdari and D. Farsi and N. Ashayeri and M. Naghshbandi and Mohsen Hariri and R. Ghafoury},
  journal={Saudi Journal of Emergency Medicine},
  year={2023},
  doi={10.21203/rs.3.rs-3494621/v1},
  note={Preprint version available on Research Square}
}

Mentors

Vipin Chaudhary, PhD

Kevin J. Kranzusch Chair, Computer and Data Sciences,
Center for PEATAI, Case School of Engineering

Mohsen Hariri

Teaching Information

Education

Current Projects

Reliable Evaluation and Ranking of Reasoning LLMs

Efficient LLM Inference, Compression, and Systems

Trustworthy and Secure Foundation Models

Medical Imaging, Multimodal AI, and Clinical Applications

Ranking Reasoning LLMs under Test-Time Scaling

Quantize What Counts: More for Keys, Less for Values

Scorio.jl: A Julia package for ranking stochastic responses

Empirical evaluation of variability and multi-institutional generalizability of deep learning survival models: Application to renal cancer CT scans

Medical Image Spatial Grounding with Semantic Sampling

Sweeping Promptable Spoofs under the DirtyRAG

K^4-Serve: Robust Streaming Log Anomaly Detection for HPC & AI Infrastructure

Categorical Evaluation of LLMs under Test-Time Scaling

Geom@k: Fast to Converge, Slow to Drift

Using AI to Increase Efficiency of Multilingual Test Materials: Spanish BEL Sentences

Integrating self-configuring and foundational deep learning segmentation models for identifying the anal sphincter complex and perianal fistulae on pelvic MRI

Radiomics to Detect Inflammation and Fibrosis on Magnetic Resonance Enterography in Stricturing Crohn’s Disease

Spatial attention wavelon network (SpAWN) for survival-based risk stratification in kidney cancers via CT

Federated Image Quality Assessment of Prostate MRI Scans in a Multi-institutional Setting

Deep Learning Based Risk Stratification of Pre-operative CT Scans is Prognostic of Overall Survival in Kidney Cancers

Intra-and Peri-tumoral Radiomic Features are Predictive of Pathologic Response to Multiple Neoadjuvant Therapy Regimen in Rectal Cancers via Pre-treatment MRI

Virtual Reality as an Acute Pain Reliever During Laceration Repair in Emergency Departments: A Randomized Controlled Trial

Mentors

Vipin Chaudhary, PhD

Collaborators