Reinforcement Learning Environment for Hermes

Design document for an Atropos-based RL environment that trains a dispatch/prompting model from issue worker outcomes.

Motivation

The issue worker generates natural RL signal on every run: an issue (prompt) goes in, an agent produces code (trajectory), and the outcome is scored (PR merged, no commits, escalated). This data is already captured by telemetry.py and retrospective.py. An RL environment formalizes this feedback loop to train a model that improves dispatch decisions over time.

The Broader Shift: Token Economics

A16z argues (There Are Only Two Paths Left for Software) that software economics are reorganizing around AI agents that consume products via tokens rather than seats. Engineers will manage 20-30 agents simultaneously, spending ~$1000/month per engineer on token access.

This system is a concrete instance of that thesis. One human (goern) manages an autonomous agent (hermes) that dispatches coding agents (claude) to resolve issues across repos. The economics:

  • Seat cost: Zero. The Claude Max subscription is flat-rate, not per-seat.
  • Token cost: The dispatch model runs on cheap tokens (haiku for hermes gateway). The expensive tokens (Claude for coding) are covered by subscription.
  • Human cost: Proportional to escalation rate. As the RL improves the dispatch model, escalations decrease, and the human’s time shifts from reviewing agent output to writing better issue descriptions.

The RL environment is the mechanism that drives this system from “human manages agents” toward “agents manage themselves, human sets direction.” Each improvement in autonomous resolution rate is a direct reduction in per-issue human cost β€” the same dynamic a16z describes as “your customers' first and most obvious source of AI savings is labor efficiency.”

The reward function encodes this: clean merges (high result score) reduce human review time; productive follow-on issues (high outcome score) mean the agent is generating compounding value, not just completing tasks.

What Gets Trained

Not Claude. We can’t fine-tune the Claude Code CLI. Instead, the RL environment trains a small local dispatch model (e.g., Qwen 2.5 7B on a GPU server) that optimizes:

  1. Prompt construction β€” what context to include for each issue type
  2. Agent selection β€” which agent to dispatch (claude, researcher, reviewer)
  3. Retry vs escalate β€” optimal attempt budget per issue type
  4. Issue quality prediction β€” pre-dispatch success likelihood (quality gate)

The trained model replaces the current keyword-matching heuristic in run-agent.sh --match and the hard-coded 3-attempt limit.

Business-Level Impact

The Outputs β†’ Results β†’ Outcomes chain doesn’t stop at the codebase. There is a fourth layer: the business outcome that the RL system ultimately serves.

Outputs β†’ Results β†’ Outcomes β†’ Business Impact
(commits)  (PR merged)  (issue resolved)  (velocity, cost, reliability)

The RL environment improves the dispatch model, which improves agent success rates, which reduces three business-level costs:

  1. Human review time. Every PR that needs human edits costs reviewer hours. A model that learns to produce clean merges directly reduces the review burden. Measurable as: time between PR creation and merge, trending downward.

  2. Issue throughput. The current system processes one issue per 30-minute timer tick, with a 60% first-attempt success rate. Improving prompt construction and agent selection increases the number of issues resolved per day without adding compute. Measurable as: issues closed per week with the hermes-review label.

  3. Escalation cost. Every human-required escalation means the autonomous system failed and a human must context-switch to understand and resolve the issue. The quality gate (trained by RL) reduces wasted attempts by predicting failure before spending 20 minutes of compute. Measurable as: escalation rate trending toward zero.

The RL loop creates a flywheel: better dispatch β†’ more clean merges β†’ more outcome data β†’ better reward signal β†’ better dispatch. The business metric that captures this is autonomous resolution rate β€” the percentage of hermes-ready issues that reach hermes-review (PR created) without human intervention. The target is >80%.

Mapping to Atropos Concepts

Atropos Concept Hermes Equivalent
Environment HermesIssueEnv β€” fetches issues, dispatches agents, scores outcomes
Item (prompt) Codeberg issue title + body + repo metadata
Trajectory (rollout) Agent’s response: code changes, commits, PR
Reward signal Multi-signal: immediate (syntax, structure) + delayed (PR merge)
Group Multiple attempts on the same issue (GRPO-style)
Metadata Telemetry JSON blob from telemetry.py

Environment Design

Config

from pydantic import Field
from atroposlib.envs import BaseEnv, BaseEnvConfig

class HermesIssueEnvConfig(BaseEnvConfig):
    codeberg_repos: str = Field(
        default="brenner-axiom/hermes-test-sandbox",
        description="Space-separated list of repos to scan",
    )
    codeberg_token: str = Field(default="", description="Codeberg API token")
    honcho_workspace: str = Field(default="hermes", description="Honcho workspace")
    max_issue_tokens: int = Field(default=2048, description="Max tokens for issue text")
    lookback_days: int = Field(default=7, description="Days to look back for delayed rewards")
    use_delayed_rewards: bool = Field(default=True, description="Include PR merge signal")

class HermesIssueEnv(BaseEnv):
    name = "hermes-issue-worker"
    env_config_cls = HermesIssueEnvConfig

Data Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Codeberg   β”‚     β”‚  HermesIssueEnv  β”‚     β”‚ Atropos Trainer β”‚
β”‚   Issues     │────▢│  (RPi5 or local) │────▢│  (GPU server)   β”‚
β”‚              β”‚     β”‚                  β”‚     β”‚                 β”‚
β”‚  hermes-readyβ”‚     β”‚  get_next_item() β”‚     β”‚  Receives:      β”‚
β”‚  label       β”‚     β”‚  score_response()β”‚     β”‚  - tokens       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  collect_traj()  β”‚     β”‚  - masked_tokensβ”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  - logprobs     β”‚
                              β–²               β”‚  - rewards      β”‚
                              β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
                     β”‚  Delayed Reward β”‚               β”‚
                     β”‚  (retrospective)β”‚        β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                     β”‚                 β”‚        β”‚  Trained    β”‚
                     β”‚  PR merged: +0.7β”‚        β”‚  dispatch   β”‚
                     β”‚  PR rejected:-0.3β”‚       β”‚  model      β”‚
                     β”‚  Human edit:+0.2β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

get_next_item β€” Issue Fetcher

Fetches the oldest open issue with hermes-ready label from configured repos. Returns the issue as a structured item with title, body, labels, and repo metadata. Returns None when no issues are available (environment pauses).

async def get_next_item(self):
    for repo in self.config.codeberg_repos.split():
        issues = await self.codeberg_api(
            "GET",
            f"/repos/{repo}/issues"
            f"?labels=hermes-ready&state=open&sort=created&direction=asc&limit=1"
        )
        if issues:
            issue = issues[0]
            return {
                "repo": repo,
                "issue_id": issue["number"],
                "title": issue["title"],
                "body": issue["body"] or "",
                "labels": [l["name"] for l in issue.get("labels", [])],
                "repo_file_count": await self.get_repo_file_count(repo),
            }
    return None

collect_trajectory β€” Agent Dispatch + Scoring

Constructs a prompt from the issue, sends it to the model being trained (the dispatch model), and scores the output. The dispatch model generates a structured decision: which agent, what prompt enrichment, and what context to include.

async def collect_trajectory(self, item):
    # The dispatch model generates the agent invocation strategy
    dispatch_prompt = self.build_dispatch_prompt(item)

    async with self.server.managed_server(tokenizer=self.tokenizer) as managed:
        completion = await managed.chat_completion(
            messages=[
                {"role": "system", "content": DISPATCH_SYSTEM_PROMPT},
                {"role": "user", "content": dispatch_prompt},
            ],
            n=1,
            max_tokens=2048,
            temperature=0.7,
        )

        state = managed.get_state()
        node = state["nodes"][0]
        decision = completion.choices[0].message.content

        # Execute the decision (actually run the agent)
        outcome = await self.execute_dispatch(item, decision)

        # Score based on outcome
        reward = self.compute_reward(item, decision, outcome)

        return ScoredDataItem(
            tokens=node.tokens,
            masked_tokens=node.masked_tokens,
            logprobs=node.logprobs,
            score=reward,
        ), []

Reward Function

The reward function maps to the Outputs β†’ Results β†’ Outcomes causal chain (reference). Each step moves further from agent control and closer to real-world impact:

Outputs β†’ Results β†’ Outcomes
(What the agent delivered) β†’ (What it produced) β†’ (What changed because of it)

Reward = Output Score + Result Score + Outcome Score
Layer Timing Agent Control Examples
Output Immediate Full Commits, PR created, code compiles
Result Hours Partial PR merged, tests pass in CI, no human edits needed
Outcome Days–weeks Indirect Issue resolved, follow-on work unblocked, codebase improved

Every dispatch carries an implicit hypothesis:

If we deliver [code changes] (output), we expect [a clean PR merge] (result), which should drive [the issue being resolved and the codebase improving] (outcome).

A break anywhere in the chain signals failure β€” commits without a merge (output without result), or a merge that requires human fixes (result without clean outcome).

Output Signals (immediate, under agent control)

Signal Reward Condition
Agent completed without error +0.1 exit_code == 0
Commits were made +0.2 commits > 0
PR was created +0.1 pr_url is not None
Reasonable time spent +0.1 30s < elapsed < 600s
Code compiles/parses +0.1 syntax check passes
Issue referenced in commit +0.1 commit message contains #N
Agent was blocked -0.2 blocked == true
Agent timed out -0.3 outcome == timed_out
No output produced -0.2 outcome == no_commits and no findings

Result Signals (hours later, partially under agent control)

Results measure whether the output was adopted β€” did the PR merge cleanly? The agent can influence this by producing correct, well-tested code, but the human reviewer is the gatekeeper.

Signal Reward Condition
PR merged without changes +0.7 merged and not human_modified
PR merged with human edits -0.3 merged but human had to fix it
PR closed (rejected) -0.5 closed without merge
First-attempt success +0.2 bonus: merged on attempt 1

Human edits are negative. If a human had to modify the PR before merging, the agent’s output was incomplete or incorrect. The model should learn to produce PRs that merge without intervention. A merge with edits is an output that produced a result, but not a clean one.

Outcome Signals (days–weeks later, indirect agent influence)

Outcomes measure the meaningful change β€” was the issue actually resolved? Did the work improve the codebase? Did it unblock further progress? These are lagging indicators influenced by many factors beyond the agent’s control.

Signal Reward Condition
Issue closed (resolved) +0.1 issue state == closed after PR merge
Issue still open after 7 days -0.1 stale despite PR being merged
Spawned follow-on issues +0.3 issues referencing this one exist
Follow-on issues merged easily +0.2 bonus: follow-ons merged on attempt 1
Codebase regression -0.4 follow-on issues are bug fixes for this PR

Follow-on issues are positive. Good PRs sometimes spawn follow-on work (tests, docs, refactoring). If those follow-on issues are resolved easily (first-attempt merge), the original PR set up the codebase well β€” the agent made good architectural decisions.

Regressions are strongly negative. If follow-on issues are bug fixes for code introduced by this PR, the agent introduced defects. The distinction between “spawned productive follow-on work” and “caused bugs that needed fixing” is the difference between an output that drove positive outcomes and one that drove negative ones.

def compute_output_reward(self, outcome):
    """Score the deliverable itself. Fully under agent control."""
    reward = 0.0

    if outcome["exit_code"] == 0:
        reward += 0.1
    if outcome["commits"] > 0:
        reward += 0.2
    if outcome.get("pr_url"):
        reward += 0.1
    if 30 < outcome["elapsed_seconds"] < 600:
        reward += 0.1
    if outcome["outcome"] == "blocked":
        reward -= 0.2
    if outcome["outcome"] == "timed_out":
        reward -= 0.3
    if outcome["outcome"] == "no_commits" and outcome["findings"] == 0:
        reward -= 0.2

    return max(min(reward, 1.0), -1.0)

def compute_result_reward(self, telemetry, pr_data):
    """Score whether the output was adopted. Partially under agent control."""
    reward = 0.0

    if pr_data and pr_data.get("merged"):
        if pr_data.get("human_modified"):
            # Output produced a result, but not a clean one
            reward -= 0.3
        else:
            # Clean adoption β€” output β†’ result chain intact
            reward += 0.7
        if telemetry["attempt"] == 1:
            reward += 0.2  # First-attempt bonus
    elif pr_data and pr_data["state"] == "closed":
        # Output rejected β€” chain broken at result layer
        reward -= 0.5

    return reward

def compute_outcome_reward(self, issue_data, follow_on_issues=None):
    """Score the meaningful change. Indirect agent influence."""
    reward = 0.0

    # Was the issue actually resolved?
    if issue_data.get("state") == "closed":
        reward += 0.1
    else:
        # Issue still open 7+ days after PR merged
        reward -= 0.1

    if follow_on_issues:
        # Classify follow-ons: productive work vs regressions
        bug_fixes = [
            f for f in follow_on_issues
            if any(l in f.get("labels", []) for l in ["bug", "fix", "regression"])
        ]
        productive = [f for f in follow_on_issues if f not in bug_fixes]

        if productive:
            reward += 0.3  # Spawned productive follow-on work
            easy_merges = sum(
                1 for f in productive
                if f.get("merged_on_attempt", 99) == 1
            )
            if easy_merges > 0:
                reward += 0.2  # Follow-ons merged easily (good architecture)

        if bug_fixes:
            reward -= 0.4  # Introduced regressions (negative outcome)

    return reward

def compute_total_reward(self, outcome, telemetry, pr_data,
                         issue_data, follow_on_issues=None):
    """Total reward across the Outputs β†’ Results β†’ Outcomes chain.

    Hypothesis: If we deliver [code changes] (output), we expect
    [a clean PR merge] (result), which should drive [the issue
    being resolved and the codebase improving] (outcome).
    """
    output_r = self.compute_output_reward(outcome)
    result_r = self.compute_result_reward(telemetry, pr_data)
    outcome_r = self.compute_outcome_reward(issue_data, follow_on_issues)

    return output_r + result_r + outcome_r

The three reward functions correspond to three questions:

  • Output: What did the agent deliver? (commits, PR, code quality)
  • Result: What did the output produce? (clean merge, or human had to fix it)
  • Outcome: What changed because of it? (issue resolved, codebase improved or regressed)

Dispatch Model Decision Format

The model being trained outputs structured JSON:

{
  "agent": "claude",
  "context_strategy": "include_file_listing",
  "prompt_enrichment": [
    "List existing files before making changes",
    "Run tests after modifying code"
  ],
  "estimated_difficulty": "medium",
  "should_attempt": true,
  "confidence": 0.75,
  "reasoning": "Issue asks for dependency migration, needs file context"
}

If should_attempt is false, the environment skips the dispatch and reports hermes-needs-clarification β€” this is the quality gate.

Training Modes

Online (Full Loop)

The environment runs on the RPi5, fetches real issues, dispatches real agents, and sends scored trajectories to a remote Atropos trainer. This requires:

  • Atropos server on a GPU machine
  • Network connectivity RPi5 ↔ trainer
  • Real Codeberg issues being processed
  • Slow iteration (30min per issue)

Offline (Batch Learning)

The retrospective.py already collects telemetry + PR outcomes. Export this as a dataset and train offline:

  1. Export all telemetry JSON blobs from Codeberg issue comments
  2. Join with PR merge/reject outcomes
  3. Construct ScoredDataGroup entries
  4. Train the dispatch model on historical data

This is faster (no waiting for real issues) and lower risk (no real PRs created).

  1. Phase 1: Collect telemetry for 50-100 issues (current system, no changes)
  2. Phase 2: Train offline on collected data, validate quality gate predictions
  3. Phase 3: Deploy trained model as the dispatch decision-maker
  4. Phase 4: Switch to online RL with Atropos for continuous improvement

Data Pipeline

Codeberg Issues
     β”‚
     β–Ό
hermes-issue-worker.sh β†’ telemetry.py β†’ Codeberg comments (JSON)
                                       β†’ Honcho sessions
     β”‚
     β–Ό (daily)
retrospective.py β†’ lessons β†’ Honcho memory
                 β†’ digest  β†’ Codeberg tracking issue
     β”‚
     β–Ό (export)
export_training_data.py β†’ ScoredDataGroup JSONL
     β”‚
     β–Ό
Atropos trainer β†’ updated dispatch model
     β”‚
     β–Ό
quality_gate.py (uses trained model for predictions)

Export Script

# export_training_data.py β€” extract training data from Codeberg telemetry
def export_scored_groups(repos, output_path):
    """Export telemetry + outcomes as Atropos-compatible JSONL."""
    for repo in repos:
        issues = get_all_issues_with_telemetry(repo)
        for issue in issues:
            telemetry_entries = parse_telemetry_comments(issue)
            pr = find_linked_pr(issue)

            for entry in telemetry_entries:
                prompt = build_dispatch_prompt(issue)
                immediate_reward = compute_reward_from_telemetry(entry)
                delayed_reward = compute_delayed_reward(entry, pr)

                scored_item = {
                    "prompt": prompt,
                    "response": entry,
                    "immediate_reward": immediate_reward,
                    "delayed_reward": delayed_reward,
                    "total_reward": immediate_reward + delayed_reward,
                    "metadata": {
                        "repo": repo,
                        "issue_id": issue["number"],
                        "attempt": entry["attempt"],
                        "outcome": entry["outcome"],
                    },
                }
                write_jsonl(output_path, scored_item)

Infrastructure Requirements

Component Where Resources
HermesIssueEnv RPi5 or local machine Minimal (API calls only)
Atropos trainer GPU server 1x GPU (A100/H100 for 7B model)
Dispatch model RPi5 (inference) ~4GB RAM for quantized 7B
Codeberg API External Rate-limited, use caching
Honcho External (managed) Included in plan

Evaluation

async def evaluate(self):
    """Periodic evaluation: accuracy of dispatch decisions."""
    # Fetch recent outcomes from Codeberg
    recent = get_recent_completed_issues(days=7)

    metrics = {
        "success_rate": count_merged / count_total,
        "first_attempt_rate": count_first_attempt / count_merged,
        "escalation_rate": count_human_required / count_total,
        "avg_attempts": sum_attempts / count_total,
        "avg_time_to_merge": avg_merge_time_hours,
    }

    self.wandb_log(metrics)

Implementation Phases

Phase 1: Data Collection (current β€” in progress)

  • Telemetry.py captures per-attempt data
  • Retrospective.py generates daily lessons
  • Honcho stores cross-session context
  • Accumulate 50+ issues of telemetry

Phase 2: Offline Analysis

  • export_training_data.py β€” extract telemetry as JSONL dataset
  • Analyze success/failure correlations (prompt length, issue labels, etc.)
  • Train simple classifier (logistic regression or small transformer)
  • Deploy as quality_gate.py (#4)

Phase 3: Atropos Environment

  • hermes_issue_env.py β€” BaseEnv subclass
  • Reward function with immediate + delayed signals
  • Dispatch model training on GPU server
  • Evaluation pipeline

Phase 4: Online RL

  • Deploy trained dispatch model on RPi5 (quantized)
  • Replace --match heuristic with model inference
  • Continuous online training via Atropos
  • A/B testing: model dispatch vs heuristic dispatch

Open Questions

  1. Model size: Can a quantized 7B model run on RPi5 for inference? 4GB RAM is tight with the 512MB container limit. May need a separate inference service.

  2. Delayed reward attribution: When a PR is merged days later, how do we attribute the reward back to the specific trajectory? Atropos supports offline scoring, but the pipeline needs to be built.

  3. Exploration vs exploitation: Early on, the model should try different dispatch strategies (exploration). Later, it should converge on what works (exploitation). The temperature parameter and issue sampling strategy control this.

  4. Safety: The dispatch model decides whether to attempt an issue. A bad model could either attempt everything (wasting compute) or nothing (starving the pipeline). The 3-attempt escalation limit provides a safety floor.

  5. Cold start: Until enough data accumulates, the heuristic-based --match and hard-coded retry limit are fine. The RL environment enhances, not replaces, the existing system.