I would like to clarify whether Marketplace / app-based Rovo agents currently support the built-in evaluation workflow described in the Rovo documentation.
The documentation for Rovo agent evaluation explains the workflow through agent settings in Rovo/Studio, including dataset upload, response accuracy, resolution rate, and manual testing. However, it is not clear whether this functionality is also available for agents that are delivered as Marketplace apps / Forge app-based agents, rather than created directly through the Rovo/Studio UI.
Could you please confirm the following:
- Do Marketplace / app-based Rovo agents currently support the built-in evaluation workflow?
- If not, is this limitation expected, or is support planned in the future?
- Is there any recommended evaluation approach for Marketplace / app-based agents at the moment?
We want to understand whether built-in evaluation can be used for these agents, or whether we should design our own manual/human-based evaluation framework.
@OlehDanylevych,
Not at this time.
From what I can find internally, this gap is not an expected limitation in the long-run. However, I cannot find a timeline on which we plan to close the gap.
The earlier advice from Forge PMs for building Forge-based Rovo Agents before the Evaluations feature, and for building with Forge LLMs now, is to use “off line evaluations”, where “off line” is intended to mean, “Using LLMs but not Rovo directly”.
As an informal pointer tho how to do that, I have been experimenting with an open source evaluations framework called Promptfoo that works well inside common testing frameworks. The main Forge “trick” is to extract the Agent Instructions (aka the prompt entry) from the manifest. To understand the “wiring”, here’s a very simple example where I’ve wrapped promptfoo evals into the automated tests: GitHub - ibuchanan/forge-rovo-guardrail-defect: Use the Defect Guardrail Agent to assess defect quality of defect reports · GitHub
I hope that works better than fully manual evaluations while you wait for us to close the gap with Studio-based Evaluations.