Case Western Reserve University
Mohsen Hariri

Mohsen Hariri

AI Scientist

Teaching Information

Machine learning, large language models, Bayesian statistics, Linear algebra

Education

  • M.S. in Computer and Data Sciences, Case Western Reserve University, 2025
  • B.S. in Electrical Engineering, University of Tehran

Current Projects

Reliable Evaluation and Ranking of Reasoning LLMs

Develops Bayesian, geometric, and ranking-based evaluation methods for reasoning models under test-time scaling, with a focus on uncertainty quantification, ranking stability, and compute-efficient benchmarking.

Related Publications:

Efficient LLM Inference, Compression, and Systems

Builds methods for KV-cache quantization, lossless model compression, and online anomaly detection to make LLM inference and scientific infrastructure more efficient and reliable.

Related Publications:

Trustworthy and Secure Foundation Models

Studies robust evaluation, failure analysis, and security risks in modern foundation-model pipelines, including stochastic evaluation protocols and RAG/LoRA attack surfaces.

Related Publications:

Medical Imaging, Multimodal AI, and Clinical Applications

Applies machine learning to medical imaging, visual grounding, segmentation, survival modeling, and radiomics in collaboration with clinical and scientific partners.

Related Publications:
Filter:

Ranking Reasoning LLMs under Test-Time Scaling

2026

ACL 2026 Main

This paper studies how to rank reasoning large language models when evaluation uses multiple stochastic samples per prompt under test-time scaling. It formalizes dense benchmark ranking in this repeated-trial setting and introduces Scorio, a library that implements Bayesian, paired-comparison, psychometric, voting, and spectral ranking methods. Across twenty reasoning models and four Olympiad-style math benchmarks, the study shows that many full-trial rankings closely match a Bayesian gold standard while low-budget methods can be less stable. The results provide practical guidance for reliable model ranking under both high- and low-budget evaluation settings.

Artificial Intelligence Trustworthy AI

Quantize What Counts: More for Keys, Less for Values

2026

ACL 2026 Findings

This work studies asymmetric KV-cache quantization for large language models and shows that key tensors carry more information than value tensors. The analysis motivates allocating more bits and stronger outlier handling to keys than to values, instead of quantizing both sides identically. Experiments show that key-favored bit allocation preserves much more accuracy at the same memory budget. The paper provides both theoretical motivation and practical guidance for more efficient LLM inference.

Artificial Intelligence HPC

Scorio.jl: A Julia package for ranking stochastic responses

2026

JuliCon 2026

Scorio.jl is a Julia package for evaluating and ranking systems from repeated stochastic responses on shared tasks. It provides a common tensor-based interface for direct score-based, pairwise, psychometric, voting, graph, and listwise ranking methods. The package supports methodological studies of ranking stability as well as day-to-day leaderboard construction. It makes ranking under repeated stochastic observation easier to analyze across different assumptions and ranking families.

Artificial Intelligence HPC

Empirical evaluation of variability and multi-institutional generalizability of deep learning survival models: Application to renal cancer CT scans

2026

Computers in Biology and Medicine

This paper systematically studies how methodological choices affect the robustness and external generalization of CT-based deep learning survival models for renal cancer. It examines data partitioning, data order, random initialization, and augmentation strategies on a multi-institutional cohort spanning nine institutions. The study finds that covariate-balanced partitioning and carefully chosen augmentations materially improve external validation performance, while initialization meaningfully affects variance. These results provide practical recommendations for building more stable and generalizable survival models from medical imaging.

Medical Imaging Computer Vision Artificial Intelligence

Medical Image Spatial Grounding with Semantic Sampling

2026

MICCAI 2026 (under review)

This work studies spatial grounding for vision-language models in 3D medical imaging, where anatomy, modality, slice direction, and coordinate systems create unique challenges. It introduces MIS-Ground, a benchmark for analyzing failure modes in medical image spatial grounding, and MIS-SemSam, an inference-time semantic sampling method that improves grounding accuracy without retraining. The paper evaluates how visual and textual prompting choices influence grounding performance across clinical imaging settings. It advances reproducible evaluation and practical improvement of medical VLM grounding.

Medical Imaging Computer Vision Artificial Intelligence

Sweeping Promptable Spoofs under the DirtyRAG

2026

ICML 2026 (under review)

This paper studies security vulnerabilities in retrieval-augmented generation through DirtyRAG, a query-blind benign-passage attack that can be steered by prompting. It shows that promptable spoof passages remain effective against strong defenses and exposes a practical attack surface for real-world RAG systems. The work also introduces RAG-ATag, a benchmark for evaluating RAG security under these attack conditions. It highlights the need for more robust retrieval and generation defenses in deployed LLM systems.

Artificial Intelligence Trustworthy AI

K^4-Serve: Robust Streaming Log Anomaly Detection for HPC & AI Infrastructure

2026

ACM PEARC 2026 (under review)

K^4-Serve operationalizes the K^4 framework for streaming anomaly detection on production HPC and AI infrastructure logs. It combines Kafka-based ingestion, versioned normalization, sliding-window scoring, retraining, and observability features to support robust real-world deployment. The system achieves stable deployment on real HPC logs with near-perfect event-level detection and only one false alert in the reported study. The work bridges anomaly-detection methodology and production cyberinfrastructure practice.

HPC Artificial Intelligence

Categorical Evaluation of LLMs under Test-Time Scaling

2026

COLM 2026 (under review)

This work argues that binary pass-based metrics are too coarse for evaluating reasoning models under test-time scaling. It introduces a categorical Bayesian framework that scores rubric-defined outcomes with uncertainty rather than collapsing all outputs into pass-or-fail labels. The study shows that lightweight runtime signals can support accurate categorical evaluation without relying on a judge model and that rubric design can materially change model rankings. The paper extends uncertainty-aware LLM evaluation beyond binary correctness.

Artificial Intelligence Trustworthy AI

Geom@k: Fast to Converge, Slow to Drift

2026

COLM 2026 (under review)

This paper studies evaluation metrics for test-time scaling by separating answer discovery from repeated correctness. It derives Geom@k and the broader GeoSpectrum@K family from a common hypergeometric view of fixed-budget metrics. Across aggregate settings, Geom@2 provides a strong balance of fast convergence and low ranking drift relative to alternative summaries. The work offers a compute-aware perspective on stable evaluation under repeated sampling.

Artificial Intelligence Trustworthy AI

Using AI to Increase Efficiency of Multilingual Test Materials: Spanish BEL Sentences

2026

Work in progress

This work-in-progress explores how AI can improve the efficiency of creating multilingual auditory test materials, with a focus on Spanish BEL sentences. The project investigates workflow acceleration and quality support for multilingual assessment design. It sits at the intersection of language technology, hearing research, and educational test development. The aim is to reduce manual burden while preserving the validity of test materials.

Artificial Intelligence

Integrating self-configuring and foundational deep learning segmentation models for identifying the anal sphincter complex and perianal fistulae on pelvic MRI

2025

SPIE Medical Imaging 2025

This paper introduces an automated pelvic MRI segmentation pipeline that combines nnU-Net with MedSAM for identifying perianal fistulae and the anal sphincter complex. The approach leverages self-configuring and foundation- model segmentation components to improve robustness on a difficult clinical anatomy problem. It is designed to support interventional guidance and surgical planning in Crohn's disease. The work demonstrates how task-specific and foundation-model methods can be integrated for clinically useful MRI analysis.

Medical Imaging Computer Vision Artificial Intelligence

Radiomics to Detect Inflammation and Fibrosis on Magnetic Resonance Enterography in Stricturing Crohn’s Disease

2024

Journal of Crohn's and Colitis

This study develops radiomics-based machine learning models to characterize inflammation and fibrosis in Crohn's disease strictures from magnetic resonance enterography. The models improve diagnostic discrimination relative to radiologist visual scoring alone and show additional value when combined with expert assessment. The work addresses an important unmet need in noninvasive characterization of stricturing disease. It supports more quantitative and reproducible imaging-based assessment in inflammatory bowel disease.

Medical Imaging Computer Vision Artificial Intelligence

Spatial attention wavelon network (SpAWN) for survival-based risk stratification in kidney cancers via CT

2024

SPIE Medical Imaging 2024

SpAWN introduces a survival-risk stratification model for kidney cancer CT that combines spatial attention with wavelon activations. The design aims to improve interpretability and cross-cohort generalization for imaging-based survival prediction. The paper demonstrates that architectural choices tailored to spatial context can strengthen risk modeling from pre-operative scans. It contributes to clinically relevant prognostic modeling in oncologic imaging.

Medical Imaging Computer Vision Artificial Intelligence

Federated Image Quality Assessment of Prostate MRI Scans in a Multi-institutional Setting

2024

AACR Annual Meeting 2024

This work addresses image-quality variability in prostate MRI across multiple institutions using a federated analysis setting. It studies how artifact- related quality differences can affect the reliability and portability of downstream machine learning models. The abstract highlights the importance of multi-institutional quality assessment before model development and deployment. It contributes to more reliable imaging AI in federated and heterogeneous clinical environments.

Medical Imaging Artificial Intelligence

Deep Learning Based Risk Stratification of Pre-operative CT Scans is Prognostic of Overall Survival in Kidney Cancers

2024

AACR Annual Meeting 2024

This abstract reports a deep learning model that uses pre-operative CT scans to predict overall survival in kidney cancer. The model improves pre-operative risk assessment and offers prognostic value beyond standard clinicopathological factors. It represents an early step toward multi- institutional imaging biomarkers for survival-based treatment planning. The work supports broader use of CT-derived representations for oncologic prognostication.

Medical Imaging Computer Vision Artificial Intelligence

Intra-and Peri-tumoral Radiomic Features are Predictive of Pathologic Response to Multiple Neoadjuvant Therapy Regimen in Rectal Cancers via Pre-treatment MRI

2024

AACR Annual Meeting 2024

This study analyzes intra-tumoral and peri-tumoral radiomic features from pretreatment MRI to predict pathologic response in rectal cancer. It evaluates whether quantitative imaging phenotypes can identify responders across multiple neoadjuvant treatment regimens. The work aims to improve patient stratification beyond traditional staging and biomarker approaches. It contributes to noninvasive response prediction in rectal cancer management.

Medical Imaging Computer Vision Artificial Intelligence

Virtual Reality as an Acute Pain Reliever During Laceration Repair in Emergency Departments: A Randomized Controlled Trial

2023

Saudi Journal of Emergency Medicine

This randomized controlled trial investigates whether virtual reality can reduce acute pain during laceration repair in emergency departments. Adult patients undergoing repair were studied in a real clinical setting to assess pain relief during the procedure. The work explores immersive VR as a practical non-pharmacologic intervention for emergency care. It adds clinical evidence on the use of interactive technology for procedural pain management.

Artificial Intelligence