Reinforcement Learning Environment for Hermes
Design document for an Atropos-based RL environment that trains a dispatch/prompting model from issue worker outcomes.
Motivation
The issue worker generates natural RL signal on every run: an issue (prompt) goes in, an agent produces code (trajectory), and the outcome is scored (PR merged, no commits, escalated). This data is already captured by telemetry.py and retrospective.py. An RL environment formalizes this feedback loop to train a model that improves dispatch decisions over time.
The Broader Shift: Token Economics
A16z argues (There Are Only Two Paths Left for Software) that software economics are reorganizing around AI agents that consume products via tokens rather than seats. Engineers will manage 20-30 agents simultaneously, spending ~$1000/month per engineer on token access.
This system is a concrete instance of that thesis. One human (goern) manages an autonomous agent (hermes) that dispatches coding agents (claude) to resolve issues across repos. The economics:
- Seat cost: Zero. The Claude Max subscription is flat-rate, not per-seat.
- Token cost: The dispatch model runs on cheap tokens (haiku for hermes gateway). The expensive tokens (Claude for coding) are covered by subscription.
- Human cost: Proportional to escalation rate. As the RL improves the dispatch model, escalations decrease, and the human’s time shifts from reviewing agent output to writing better issue descriptions.
The RL environment is the mechanism that drives this system from “human manages agents” toward “agents manage themselves, human sets direction.” Each improvement in autonomous resolution rate is a direct reduction in per-issue human cost β the same dynamic a16z describes as “your customers' first and most obvious source of AI savings is labor efficiency.”
The reward function encodes this: clean merges (high result score) reduce human review time; productive follow-on issues (high outcome score) mean the agent is generating compounding value, not just completing tasks.
What Gets Trained
Not Claude. We can’t fine-tune the Claude Code CLI. Instead, the RL environment trains a small local dispatch model (e.g., Qwen 2.5 7B on a GPU server) that optimizes:
- Prompt construction β what context to include for each issue type
- Agent selection β which agent to dispatch (claude, researcher, reviewer)
- Retry vs escalate β optimal attempt budget per issue type
- Issue quality prediction β pre-dispatch success likelihood (quality gate)
The trained model replaces the current keyword-matching heuristic in
run-agent.sh --match and the hard-coded 3-attempt limit.
Business-Level Impact
The Outputs β Results β Outcomes chain doesn’t stop at the codebase. There is a fourth layer: the business outcome that the RL system ultimately serves.
Outputs β Results β Outcomes β Business Impact
(commits) (PR merged) (issue resolved) (velocity, cost, reliability)
The RL environment improves the dispatch model, which improves agent success rates, which reduces three business-level costs:
-
Human review time. Every PR that needs human edits costs reviewer hours. A model that learns to produce clean merges directly reduces the review burden. Measurable as: time between PR creation and merge, trending downward.
-
Issue throughput. The current system processes one issue per 30-minute timer tick, with a 60% first-attempt success rate. Improving prompt construction and agent selection increases the number of issues resolved per day without adding compute. Measurable as: issues closed per week with the
hermes-reviewlabel. -
Escalation cost. Every
human-requiredescalation means the autonomous system failed and a human must context-switch to understand and resolve the issue. The quality gate (trained by RL) reduces wasted attempts by predicting failure before spending 20 minutes of compute. Measurable as: escalation rate trending toward zero.
The RL loop creates a flywheel: better dispatch β more clean merges β
more outcome data β better reward signal β better dispatch. The business
metric that captures this is autonomous resolution rate β the
percentage of hermes-ready issues that reach hermes-review (PR
created) without human intervention. The target is >80%.
Mapping to Atropos Concepts
| Atropos Concept | Hermes Equivalent |
|---|---|
| Environment | HermesIssueEnv β fetches issues, dispatches agents, scores outcomes |
| Item (prompt) | Codeberg issue title + body + repo metadata |
| Trajectory (rollout) | Agent’s response: code changes, commits, PR |
| Reward signal | Multi-signal: immediate (syntax, structure) + delayed (PR merge) |
| Group | Multiple attempts on the same issue (GRPO-style) |
| Metadata | Telemetry JSON blob from telemetry.py |
Environment Design
Config
from pydantic import Field
from atroposlib.envs import BaseEnv, BaseEnvConfig
class HermesIssueEnvConfig(BaseEnvConfig):
codeberg_repos: str = Field(
default="brenner-axiom/hermes-test-sandbox",
description="Space-separated list of repos to scan",
)
codeberg_token: str = Field(default="", description="Codeberg API token")
honcho_workspace: str = Field(default="hermes", description="Honcho workspace")
max_issue_tokens: int = Field(default=2048, description="Max tokens for issue text")
lookback_days: int = Field(default=7, description="Days to look back for delayed rewards")
use_delayed_rewards: bool = Field(default=True, description="Include PR merge signal")
class HermesIssueEnv(BaseEnv):
name = "hermes-issue-worker"
env_config_cls = HermesIssueEnvConfig
Data Flow
ββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Codeberg β β HermesIssueEnv β β Atropos Trainer β
β Issues ββββββΆβ (RPi5 or local) ββββββΆβ (GPU server) β
β β β β β β
β hermes-readyβ β get_next_item() β β Receives: β
β label β β score_response()β β - tokens β
ββββββββββββββββ β collect_traj() β β - masked_tokensβ
ββββββββββββββββββββ β - logprobs β
β² β - rewards β
β ββββββββββ¬βββββββββ
ββββββββββ΄βββββββββ β
β Delayed Reward β β
β (retrospective)β ββββββββΌβββββββ
β β β Trained β
β PR merged: +0.7β β dispatch β
β PR rejected:-0.3β β model β
β Human edit:+0.2β βββββββββββββββ
βββββββββββββββββββ
get_next_item β Issue Fetcher
Fetches the oldest open issue with hermes-ready label from configured repos.
Returns the issue as a structured item with title, body, labels, and repo
metadata. Returns None when no issues are available (environment pauses).
async def get_next_item(self):
for repo in self.config.codeberg_repos.split():
issues = await self.codeberg_api(
"GET",
f"/repos/{repo}/issues"
f"?labels=hermes-ready&state=open&sort=created&direction=asc&limit=1"
)
if issues:
issue = issues[0]
return {
"repo": repo,
"issue_id": issue["number"],
"title": issue["title"],
"body": issue["body"] or "",
"labels": [l["name"] for l in issue.get("labels", [])],
"repo_file_count": await self.get_repo_file_count(repo),
}
return None
collect_trajectory β Agent Dispatch + Scoring
Constructs a prompt from the issue, sends it to the model being trained (the dispatch model), and scores the output. The dispatch model generates a structured decision: which agent, what prompt enrichment, and what context to include.
async def collect_trajectory(self, item):
# The dispatch model generates the agent invocation strategy
dispatch_prompt = self.build_dispatch_prompt(item)
async with self.server.managed_server(tokenizer=self.tokenizer) as managed:
completion = await managed.chat_completion(
messages=[
{"role": "system", "content": DISPATCH_SYSTEM_PROMPT},
{"role": "user", "content": dispatch_prompt},
],
n=1,
max_tokens=2048,
temperature=0.7,
)
state = managed.get_state()
node = state["nodes"][0]
decision = completion.choices[0].message.content
# Execute the decision (actually run the agent)
outcome = await self.execute_dispatch(item, decision)
# Score based on outcome
reward = self.compute_reward(item, decision, outcome)
return ScoredDataItem(
tokens=node.tokens,
masked_tokens=node.masked_tokens,
logprobs=node.logprobs,
score=reward,
), []
Reward Function
The reward function maps to the Outputs β Results β Outcomes causal chain (reference). Each step moves further from agent control and closer to real-world impact:
Outputs β Results β Outcomes
(What the agent delivered) β (What it produced) β (What changed because of it)
Reward = Output Score + Result Score + Outcome Score
| Layer | Timing | Agent Control | Examples |
|---|---|---|---|
| Output | Immediate | Full | Commits, PR created, code compiles |
| Result | Hours | Partial | PR merged, tests pass in CI, no human edits needed |
| Outcome | Daysβweeks | Indirect | Issue resolved, follow-on work unblocked, codebase improved |
Every dispatch carries an implicit hypothesis:
If we deliver [code changes] (output), we expect [a clean PR merge] (result), which should drive [the issue being resolved and the codebase improving] (outcome).
A break anywhere in the chain signals failure β commits without a merge (output without result), or a merge that requires human fixes (result without clean outcome).
Output Signals (immediate, under agent control)
| Signal | Reward | Condition |
|---|---|---|
| Agent completed without error | +0.1 | exit_code == 0 |
| Commits were made | +0.2 | commits > 0 |
| PR was created | +0.1 | pr_url is not None |
| Reasonable time spent | +0.1 | 30s < elapsed < 600s |
| Code compiles/parses | +0.1 | syntax check passes |
| Issue referenced in commit | +0.1 | commit message contains #N |
| Agent was blocked | -0.2 | blocked == true |
| Agent timed out | -0.3 | outcome == timed_out |
| No output produced | -0.2 | outcome == no_commits and no findings |
Result Signals (hours later, partially under agent control)
Results measure whether the output was adopted β did the PR merge cleanly? The agent can influence this by producing correct, well-tested code, but the human reviewer is the gatekeeper.
| Signal | Reward | Condition |
|---|---|---|
| PR merged without changes | +0.7 | merged and not human_modified |
| PR merged with human edits | -0.3 | merged but human had to fix it |
| PR closed (rejected) | -0.5 | closed without merge |
| First-attempt success | +0.2 | bonus: merged on attempt 1 |
Human edits are negative. If a human had to modify the PR before merging, the agent’s output was incomplete or incorrect. The model should learn to produce PRs that merge without intervention. A merge with edits is an output that produced a result, but not a clean one.
Outcome Signals (daysβweeks later, indirect agent influence)
Outcomes measure the meaningful change β was the issue actually resolved? Did the work improve the codebase? Did it unblock further progress? These are lagging indicators influenced by many factors beyond the agent’s control.
| Signal | Reward | Condition |
|---|---|---|
| Issue closed (resolved) | +0.1 | issue state == closed after PR merge |
| Issue still open after 7 days | -0.1 | stale despite PR being merged |
| Spawned follow-on issues | +0.3 | issues referencing this one exist |
| Follow-on issues merged easily | +0.2 | bonus: follow-ons merged on attempt 1 |
| Codebase regression | -0.4 | follow-on issues are bug fixes for this PR |
Follow-on issues are positive. Good PRs sometimes spawn follow-on work (tests, docs, refactoring). If those follow-on issues are resolved easily (first-attempt merge), the original PR set up the codebase well β the agent made good architectural decisions.
Regressions are strongly negative. If follow-on issues are bug fixes for code introduced by this PR, the agent introduced defects. The distinction between “spawned productive follow-on work” and “caused bugs that needed fixing” is the difference between an output that drove positive outcomes and one that drove negative ones.
def compute_output_reward(self, outcome):
"""Score the deliverable itself. Fully under agent control."""
reward = 0.0
if outcome["exit_code"] == 0:
reward += 0.1
if outcome["commits"] > 0:
reward += 0.2
if outcome.get("pr_url"):
reward += 0.1
if 30 < outcome["elapsed_seconds"] < 600:
reward += 0.1
if outcome["outcome"] == "blocked":
reward -= 0.2
if outcome["outcome"] == "timed_out":
reward -= 0.3
if outcome["outcome"] == "no_commits" and outcome["findings"] == 0:
reward -= 0.2
return max(min(reward, 1.0), -1.0)
def compute_result_reward(self, telemetry, pr_data):
"""Score whether the output was adopted. Partially under agent control."""
reward = 0.0
if pr_data and pr_data.get("merged"):
if pr_data.get("human_modified"):
# Output produced a result, but not a clean one
reward -= 0.3
else:
# Clean adoption β output β result chain intact
reward += 0.7
if telemetry["attempt"] == 1:
reward += 0.2 # First-attempt bonus
elif pr_data and pr_data["state"] == "closed":
# Output rejected β chain broken at result layer
reward -= 0.5
return reward
def compute_outcome_reward(self, issue_data, follow_on_issues=None):
"""Score the meaningful change. Indirect agent influence."""
reward = 0.0
# Was the issue actually resolved?
if issue_data.get("state") == "closed":
reward += 0.1
else:
# Issue still open 7+ days after PR merged
reward -= 0.1
if follow_on_issues:
# Classify follow-ons: productive work vs regressions
bug_fixes = [
f for f in follow_on_issues
if any(l in f.get("labels", []) for l in ["bug", "fix", "regression"])
]
productive = [f for f in follow_on_issues if f not in bug_fixes]
if productive:
reward += 0.3 # Spawned productive follow-on work
easy_merges = sum(
1 for f in productive
if f.get("merged_on_attempt", 99) == 1
)
if easy_merges > 0:
reward += 0.2 # Follow-ons merged easily (good architecture)
if bug_fixes:
reward -= 0.4 # Introduced regressions (negative outcome)
return reward
def compute_total_reward(self, outcome, telemetry, pr_data,
issue_data, follow_on_issues=None):
"""Total reward across the Outputs β Results β Outcomes chain.
Hypothesis: If we deliver [code changes] (output), we expect
[a clean PR merge] (result), which should drive [the issue
being resolved and the codebase improving] (outcome).
"""
output_r = self.compute_output_reward(outcome)
result_r = self.compute_result_reward(telemetry, pr_data)
outcome_r = self.compute_outcome_reward(issue_data, follow_on_issues)
return output_r + result_r + outcome_r
The three reward functions correspond to three questions:
- Output: What did the agent deliver? (commits, PR, code quality)
- Result: What did the output produce? (clean merge, or human had to fix it)
- Outcome: What changed because of it? (issue resolved, codebase improved or regressed)
Dispatch Model Decision Format
The model being trained outputs structured JSON:
{
"agent": "claude",
"context_strategy": "include_file_listing",
"prompt_enrichment": [
"List existing files before making changes",
"Run tests after modifying code"
],
"estimated_difficulty": "medium",
"should_attempt": true,
"confidence": 0.75,
"reasoning": "Issue asks for dependency migration, needs file context"
}
If should_attempt is false, the environment skips the dispatch and
reports hermes-needs-clarification β this is the quality gate.
Training Modes
Online (Full Loop)
The environment runs on the RPi5, fetches real issues, dispatches real agents, and sends scored trajectories to a remote Atropos trainer. This requires:
- Atropos server on a GPU machine
- Network connectivity RPi5 β trainer
- Real Codeberg issues being processed
- Slow iteration (30min per issue)
Offline (Batch Learning)
The retrospective.py already collects telemetry + PR outcomes. Export this as a dataset and train offline:
- Export all telemetry JSON blobs from Codeberg issue comments
- Join with PR merge/reject outcomes
- Construct
ScoredDataGroupentries - Train the dispatch model on historical data
This is faster (no waiting for real issues) and lower risk (no real PRs created).
Hybrid (Recommended Start)
- Phase 1: Collect telemetry for 50-100 issues (current system, no changes)
- Phase 2: Train offline on collected data, validate quality gate predictions
- Phase 3: Deploy trained model as the dispatch decision-maker
- Phase 4: Switch to online RL with Atropos for continuous improvement
Data Pipeline
Codeberg Issues
β
βΌ
hermes-issue-worker.sh β telemetry.py β Codeberg comments (JSON)
β Honcho sessions
β
βΌ (daily)
retrospective.py β lessons β Honcho memory
β digest β Codeberg tracking issue
β
βΌ (export)
export_training_data.py β ScoredDataGroup JSONL
β
βΌ
Atropos trainer β updated dispatch model
β
βΌ
quality_gate.py (uses trained model for predictions)
Export Script
# export_training_data.py β extract training data from Codeberg telemetry
def export_scored_groups(repos, output_path):
"""Export telemetry + outcomes as Atropos-compatible JSONL."""
for repo in repos:
issues = get_all_issues_with_telemetry(repo)
for issue in issues:
telemetry_entries = parse_telemetry_comments(issue)
pr = find_linked_pr(issue)
for entry in telemetry_entries:
prompt = build_dispatch_prompt(issue)
immediate_reward = compute_reward_from_telemetry(entry)
delayed_reward = compute_delayed_reward(entry, pr)
scored_item = {
"prompt": prompt,
"response": entry,
"immediate_reward": immediate_reward,
"delayed_reward": delayed_reward,
"total_reward": immediate_reward + delayed_reward,
"metadata": {
"repo": repo,
"issue_id": issue["number"],
"attempt": entry["attempt"],
"outcome": entry["outcome"],
},
}
write_jsonl(output_path, scored_item)
Infrastructure Requirements
| Component | Where | Resources |
|---|---|---|
| HermesIssueEnv | RPi5 or local machine | Minimal (API calls only) |
| Atropos trainer | GPU server | 1x GPU (A100/H100 for 7B model) |
| Dispatch model | RPi5 (inference) | ~4GB RAM for quantized 7B |
| Codeberg API | External | Rate-limited, use caching |
| Honcho | External (managed) | Included in plan |
Evaluation
async def evaluate(self):
"""Periodic evaluation: accuracy of dispatch decisions."""
# Fetch recent outcomes from Codeberg
recent = get_recent_completed_issues(days=7)
metrics = {
"success_rate": count_merged / count_total,
"first_attempt_rate": count_first_attempt / count_merged,
"escalation_rate": count_human_required / count_total,
"avg_attempts": sum_attempts / count_total,
"avg_time_to_merge": avg_merge_time_hours,
}
self.wandb_log(metrics)
Implementation Phases
Phase 1: Data Collection (current β in progress)
- Telemetry.py captures per-attempt data
- Retrospective.py generates daily lessons
- Honcho stores cross-session context
- Accumulate 50+ issues of telemetry
Phase 2: Offline Analysis
-
export_training_data.pyβ extract telemetry as JSONL dataset - Analyze success/failure correlations (prompt length, issue labels, etc.)
- Train simple classifier (logistic regression or small transformer)
- Deploy as
quality_gate.py(#4)
Phase 3: Atropos Environment
-
hermes_issue_env.pyβ BaseEnv subclass - Reward function with immediate + delayed signals
- Dispatch model training on GPU server
- Evaluation pipeline
Phase 4: Online RL
- Deploy trained dispatch model on RPi5 (quantized)
- Replace
--matchheuristic with model inference - Continuous online training via Atropos
- A/B testing: model dispatch vs heuristic dispatch
Open Questions
-
Model size: Can a quantized 7B model run on RPi5 for inference? 4GB RAM is tight with the 512MB container limit. May need a separate inference service.
-
Delayed reward attribution: When a PR is merged days later, how do we attribute the reward back to the specific trajectory? Atropos supports offline scoring, but the pipeline needs to be built.
-
Exploration vs exploitation: Early on, the model should try different dispatch strategies (exploration). Later, it should converge on what works (exploitation). The temperature parameter and issue sampling strategy control this.
-
Safety: The dispatch model decides whether to attempt an issue. A bad model could either attempt everything (wasting compute) or nothing (starving the pipeline). The 3-attempt escalation limit provides a safety floor.
-
Cold start: Until enough data accumulates, the heuristic-based
--matchand hard-coded retry limit are fine. The RL environment enhances, not replaces, the existing system.