Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning — AI benchmark

Chen Linze, Cai Yufan, Hou Zhe, Dong Jin Song/Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded ReasoningUnknown

Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant perturbations, but should change when perturbations alter legally material points. We formulate this requirement as a legal-relevance-sensitive evaluation problem: LLMs should only be sensitive to the legally relevant change. We introduce a unified evaluation suite covering should-change and should-not-change evaluation across judicial fairness, robustness, and statute-confusion scenarios. Our evaluation shows that existing legal LLMs are systematically sensitive to legally irrelevant variations and often fail to distinguish related legal elements and statutory rules. To mitigate these failures, we present LexGuard, an adversarial multi-agent framework grounded in formal reasoning. LexGuard formalizes statutes into executable constraints, uses adversarial agents to extract competing fact-statute arguments, and invokes SMT solvers to verify legal satisfaction and logical consistency. Experiments show that LexGuard improves legal reasoning reliability by reducing vulnerability to manipulative framing, improving disambiguation among similar statutes, limiting the influence of legally irrelevant attributes, and increasing consistency under benign reformulations. We show that legal trustworthiness requires not only accuracy, but calibrated sensitivity to legally material changes.

benchmark

Stars0

Forks0

HF Downloads—30d

Last commit—

Refreshed1d ago

Project healthUnknownNo activity data.

Production readinessResearch / EarlyBest for exploration and prototyping.

Risk notesUnknown licenseVerify license before production use.

AgentHub Score

48 / 100

Composite score from 6 signals. How we score →

Active project

48Score

Growth

40C

Activity

30C

Documentation

70C+

Maturity

45C

Community

42C

Production

58C

GitHub stars · 90 days0 +0.0%

30d90d1y

Commit activity · 52 weeksActive contributor activity

LowHigh

JunSepDecMarNow

Practical assessment

Should you use it?

✓ Best for

Research and experimentation
Prototype development
Learning agentic patterns

◎ Strengths

Active community
Open source
Well-documented API

✕ Not ideal for

Untested at scale without validation
Teams without AI/ML expertise

⚠ Watch-outs

Review changelog before updating
Verify license for commercial use

Technical details

What's inside

Language—

License—

Sourcearxiv

Open source✗ No

Commercial use—

Docs—

Demo—

PaperarXiv ↗

AgentHub Score

Score 48/100

Below average

Alternatives

crewai

26.1k · Multi-Agent

autogen

42.7k · Multi-Agent

smolagents

11.2k · Coding

openai-agents-python

9.4k · Multi-Agent

Compare all →

Recent activity

Latest commit ——

Indexed by AgentHub crawler1d ago

Monitor for new releasesongoing