Case Western Reserve University
Wang (Van) Yang

Wang (Van) Yang

PhD Student

Filter:

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

2025

Conference on Language Modeling (COLM), October 7-10, 2025, Montreal, Canada

Recent advances in post-training enhance model reasoning but require costly training pipelines and produce inefficient, overly lengthy outputs. This paper introduces Speculative Thinking, a training-free framework enabling large reasoning models to guide smaller ones during inference at the reasoning level—distinct from token-level speculative decoding—by identifying structural cues such as paragraph breaks followed by reflective phrases where small models struggle and delegating those steps to a larger model. The method significantly boosts smaller model reasoning accuracy while shortening output length, offering an efficient inference-time paradigm that preserves the small model's compute efficiency.

Artificial Intelligence

Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

2025

39th Conference on Neural Information Processing Systems (NeurIPS 2025), December 2025

Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. This paper hypothesizes that current reasoning limitations stem partly from insufficient long-context capacity, motivated by observations that higher context window lengths correlate with stronger reasoning performance and that failed reasoning cases resemble failed long-context cases. Controlled experiments comparing architecturally identical models with varying long-context capacities confirm that enhancing long-context ability before supervised fine-tuning leads to improved reasoning, advocating for long-context capacity as a first-class design objective.

Artificial Intelligence

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

2025

63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), July 27-August 1, 2025, Vienna, Austria

Existing long-context evaluation benchmarks fail to separate long-context performance from a model's baseline ability, making cross-model comparisons unclear, and are typically constructed with fixed input lengths that limit applicability across models with different context windows. This paper introduces 100-LongBench, a length-controllable long-context benchmark with a novel metric that disentangles baseline knowledge from true long-context capability across multiple task categories. Experiments demonstrate that existing benchmarks significantly conflate baseline model strength with genuine long-context ability, revealing a widespread evaluation gap.

Artificial Intelligence