InquiTree: Evaluating AI Agents in the Scientific Inquiry Loop with Paper-Derived Research Trees — AI benchmark

Shaoyang Cui/InquiTree: Evaluating AI Agents in the Scientific Inquiry Loop with Paper-Derived Research TreesUnknown

While LLM-based agents are increasingly used in scientific workflows, it remains unclear whether they are truly qualified for the dynamic and uncertain process of discovery. Existing static evaluations often conflate genuine reasoning with rote memorization. We introduce InquiTree, a diagnostic environment that formalizes scientific inquiry as interactive Research Trees: directed acyclic graphs capturing the logical dependencies among hypothesis formulation, study design, result interpretation, and belief updating. Evaluating agents on a 30-paper test pool and releasing the open-access InquidTree-18(IT-18) subset, we identify two key limitations. First, agents exhibit an "Erosion of Marginal Capabilities": during long-horizon interactions, they develop "cognitive tunneling," where critical judgment and anomaly detection degrade relative to their intrinsic baselines. Second, performance drops on papers published after model training cutoffs, revealing a boundary between interpolation and extrapolation and suggesting that apparent competence is partly driven by parametric memory. These findings indicate that scaling context alone is insufficient for reliable AI scientists; stronger architectures or human oversight may be required to preserve critical evaluation and generalization.

benchmark

Stars0

Forks0

HF Downloads—30d

Last commit—

Refreshed1mo ago

Project healthUnknownNo activity data.

Production readinessResearch / EarlyBest for exploration and prototyping.

Risk notesUnknown licenseVerify license before production use.

AgentHub Score

55 / 100

Composite score from 6 signals. How we score →

Active project

55Score

Growth

40C

Activity

30C

Documentation

70C+

Maturity

45C

Community

42C

Production

58C

GitHub stars · 0 days observed0 not enough history

snapshots

Repository activity · 0 days observedReal snapshots from pushed_at

inactivepushed

2026-07-262026-07-27

Practical assessment

Should you use it?

✓ Best for

Research and experimentation
Prototype development
Learning agentic patterns

◎ Strengths

Active community
Open source
Well-documented API

✕ Not ideal for

Untested at scale without validation
Teams without AI/ML expertise

⚠ Watch-outs

Review changelog before updating
Verify license for commercial use

Technical details

What's inside

Language—

License—

Sourcearxiv

Open source✗ No

Commercial use—

Docs—

Demo—

PaperarXiv ↗

AgentHub Score

Score 55/100

Below average

Alternatives

Recent activity

Latest commit ——

Indexed by AgentHub crawler1mo ago

Monitor for new releasesongoing