Explore/benchmark/Cross-Vendor Sola ISPM Benchmark: Evaluating Agentic AI for Federated Identity Security Reasoning
C

Eden Yavin, Gal Engelberg, Konstantin Koutsyi, Leon Goldberg, Gal Baron/Cross-Vendor Sola ISPM Benchmark: Evaluating Agentic AI for Federated Identity Security ReasoningUnknown

The rapid proliferation of multi-cloud and SaaS platforms has transformed Identity Security Posture Management (ISPM) into a fundamentally cross-vendor challenge: critical misconfigurations and privilege escalation paths increasingly span multiple identity providers, infrastructure layers, and authentication systems never designed to interoperate. Existing evaluations focus on isolated single-platform environments and provide no means to assess whether an AI agent can reason across these fragmented boundaries. To address this gap, we introduce the Cross-Vendor Sola ISPM Benchmark, a production-grade benchmark of 50 data-grounded tasks requiring multi-hop entity resolution and cross-system correlation across eight integrated enterprise platforms including AWS, Okta, Azure AD, and Google Workspace. We also contribute an evaluation framework measuring not only final answer correctness but also evidentiary grounding, structural join fidelity, retrieval quality, and SQL equivalence. We evaluate the Sola AI Agent across five context configurations - from no injected metadata to full schema, graph, and retrieval context - using three frontier LLMs. Results show that structured relational context improves answer correctness by approximately 34% relatively and reduces exploration queries by approximately 70% across all tested models, with the largest gains driven by cross-vendor graph topology. Our findings indicate that frontier LLMs possess substantial latent security reasoning capability, but reliable cross-vendor identity analysis is fundamentally constrained by the availability of explicit relational context for entity resolution and evidentiary grounding. Under full context, the best configuration achieves 78% answer correctness while reducing complete failure to 4%.

benchmark
GitHubCompare
Refreshed 1d ago
OverviewActivity52wAlternativesDocs
Stars0
Forks0
HF Downloads30d
Last commit
Refreshed1d ago
Project healthUnknownNo activity data.
Production readinessResearch / EarlyBest for exploration and prototyping.
Risk notesUnknown licenseVerify license before production use.
AgentHub Score
48 / 100
Composite score from 6 signals. How we score →
Active project
48Score
Growth
40C
Activity
30C
Documentation
70C+
Maturity
45C
Community
42C
Production
58C
GitHub stars · 90 days0 +0.0%
30d90d1y
latest release
Commit activity · 52 weeksActive contributor activity
LowHigh
JunSepDecMarNow
Practical assessment
Should you use it?

✓ Best for

  • Research and experimentation
  • Prototype development
  • Learning agentic patterns

◎ Strengths

  • Active community
  • Open source
  • Well-documented API

✕ Not ideal for

  • Untested at scale without validation
  • Teams without AI/ML expertise

⚠ Watch-outs

  • Review changelog before updating
  • Verify license for commercial use
Technical details
What's inside
Language
License
Sourcearxiv
Open source✗ No
Commercial use
Docs
Demo

AgentHub Score

48
Score 48/100
Below average

Alternatives

C
crewai
26.1k · Multi-Agent
87
A
autogen
42.7k · Multi-Agent
71
S
smolagents
11.2k · Coding
84
O
openai-agents-python
9.4k · Multi-Agent
81
Compare all →

Recent activity

Latest commit —
Indexed by AgentHub crawler1d ago
Monitor for new releasesongoing