Explore/benchmark/More than a Judge: An Empirical Study of Agent-Human Interaction in Crowdsourced Testing Assessment
M

Yue Wang, Yuan Zhao, Shengcheng Yu, Zhenyu Chen, Qing Gu/More than a Judge: An Empirical Study of Agent-Human Interaction in Crowdsourced Testing AssessmentUnknown

Agentic AI is increasingly being integrated into software engineering workflows. In crowdsourced testing, however, the large volume and uneven quality of submitted reports still create a substantial review burden for developers. In prior work, we developed and validated a multi-agent assessment backbone based on the LLM-as-a-Judge paradigm. That backbone assesses reports along three dimensions--textuality, adequacy, and competitiveness--and was shown to align well with human consensus while substantially reducing assessment effort. Yet reliable automated judging does not by itself show whether agent outputs can improve human work when embedded into workflow. This paper studies that missing question in the context of crowdsourced testing. We investigate whether assessment-derived, actionable feedback can improve how testers revise reports, perform on later tasks, and transfer reporting practices across applications. To do so, we conducted a controlled four-stage human-subject study with 20 testers across three real-world applications. The results show that agent-generated feedback supports immediate improvements in revised reports, better first submissions on a new task after prior feedback exposure, and evidence of partial but meaningful transfer to a later application. A post-task questionnaire completed by 17 participants complements these artifact-based findings by suggesting that the feedback was generally understandable, acted upon in revision, and carried into later tasks, while also revealing remaining friction in specificity and execution. Overall, the study provides empirical evidence that, in the studied crowdsourced testing setting, assessment agents can serve not only as post-hoc judges but also as workflow-integrated feedback providers that support upstream report-quality improvement.

benchmark
GitHubCompare
Refreshed 1d ago
OverviewActivity52wAlternativesDocs
Stars0
Forks0
HF Downloads30d
Last commit
Refreshed1d ago
Project healthUnknownNo activity data.
Production readinessResearch / EarlyBest for exploration and prototyping.
Risk notesUnknown licenseVerify license before production use.
AgentHub Score
48 / 100
Composite score from 6 signals. How we score →
Active project
48Score
Growth
40C
Activity
30C
Documentation
70C+
Maturity
45C
Community
42C
Production
58C
GitHub stars · 90 days0 +0.0%
30d90d1y
latest release
Commit activity · 52 weeksActive contributor activity
LowHigh
JunSepDecMarNow
Practical assessment
Should you use it?

✓ Best for

  • Research and experimentation
  • Prototype development
  • Learning agentic patterns

◎ Strengths

  • Active community
  • Open source
  • Well-documented API

✕ Not ideal for

  • Untested at scale without validation
  • Teams without AI/ML expertise

⚠ Watch-outs

  • Review changelog before updating
  • Verify license for commercial use
Technical details
What's inside
Language
License
Sourcearxiv
Open source✗ No
Commercial use
Docs
Demo

AgentHub Score

48
Score 48/100
Below average

Alternatives

C
crewai
26.1k · Multi-Agent
87
A
autogen
42.7k · Multi-Agent
71
S
smolagents
11.2k · Coding
84
O
openai-agents-python
9.4k · Multi-Agent
81
Compare all →

Recent activity

Latest commit —
Indexed by AgentHub crawler1d ago
Monitor for new releasesongoing