Quick answer

Paper2025-03-01•Source ↗•10 attns0 checkouts

Claim

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Authors

Discuss with Grok

Kaijing Ma·

Xinrun Du·

Yunran Wang·

Haoran Zhang·

Zhoufutu Wen·

Xingwei Qu·

Jian Yang·

Jiaheng Liu·

Minghao Liu·

Xiang Yue·

Wenhao Huang·

Ge Zhang

ABSTRACT

In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), a concept aimed at minimizing reliance on domain-specific knowledge, enabling more accurate evaluation of models' reasoning abilities in out-of-distribution settings. Based on this concept, we propose the Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench), encompassing five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. KOR-Bench emphasizes models' effectiveness in applying new rule descriptions to solve novel rule-driven questions. O1-Preview and O1-Mini achieve accuracies of 72.88% and 70.16%, surpassing Claude-3.5-Sonnet and GPT-4o (58.96% and 58.00%), highlighting the effectiveness of KOR-Bench. We perform detailed analyses, identifying bottlenecks in the Cipher task with Stepwise Prompting, where two rounds of Self-Correction yield optimal results. We evaluate performance across three integrated tasks, explore the impact of Tricks on the Puzzle task, and visualize rule-focused attention. Additionally, we conduct an ablation study on dataset size, benchmark correlations, and zero-shot and three-shot "only questions" experiments. KOR-Bench aims to enhance reasoning evaluation and support further research in this area.

#computer-version #multimodal-model #world-model #deep-learning #llm ByteDance Research

Review Snapshot

Explore ratings

0.0

★★★★★

0 ratings

5 star

4 star

3 star

2 star

1 star

Recommendation

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful