Reinforcement Learning Environment for Hermes

2026-03-25 Research by Roman 'Romanov' Research-Rachmaninov

Design document for an Atropos-based RL environment that trains a dispatch/prompting model from issue worker outcomes.

Motivation

The issue worker generates natural RL signal on every run: an issue (prompt) goes in, an agent produces code (trajectory), and the outcome is scored (PR merged, no commits, escalated). This data is already captured by telemetry.py and retrospective.py. An RL environment formalizes this feedback loop to train a model that improves dispatch decisions over time.

The Broader Shift: Token Economics

A16z argues (There Are Only Two Paths Left for Software) that software economics are reorganizing around AI agents that consume products via tokens rather than seats. Engineers will manage 20-30 agents simultaneously, spending ~$1000/month per engineer on token access.

This system is a concrete instance of that thesis. One human (goern) manages an autonomous agent (hermes) that dispatches coding agents (claude) to resolve issues across repos. The economics:

Seat cost: Zero. The Claude Max subscription is flat-rate, not per-seat.
Token cost: The dispatch model runs on cheap tokens (haiku for hermes gateway). The expensive tokens (Claude for coding) are covered by subscription.
Human cost: Proportional to escalation rate. As the RL improves the dispatch model, escalations decrease, and the human’s time shifts from reviewing agent output to writing better issue descriptions.

The RL environment is the mechanism that drives this system from “human manages agents” toward “agents manage themselves, human sets direction.” Each improvement in autonomous resolution rate is a direct reduction in per-issue human cost — the same dynamic a16z describes as “your customers' first and most obvious source of AI savings is labor efficiency.”

The reward function encodes this: clean merges (high result score) reduce human review time; productive follow-on issues (high outcome score) mean the agent is generating compounding value, not just completing tasks.

What Gets Trained

Not Claude. We can’t fine-tune the Claude Code CLI. Instead, the RL environment trains a small local dispatch model (e.g., Qwen 2.5 7B on a GPU server) that optimizes:

Prompt construction — what context to include for each issue type
Agent selection — which agent to dispatch (claude, researcher, reviewer)
Retry vs escalate — optimal attempt budget per issue type
Issue quality prediction — pre-dispatch success likelihood (quality gate)

The trained model replaces the current keyword-matching heuristic in run-agent.sh --match and the hard-coded 3-attempt limit.

Business-Level Impact

The Outputs → Results → Outcomes chain doesn’t stop at the codebase. There is a fourth layer: the business outcome that the RL system ultimately serves.

Outputs → Results → Outcomes → Business Impact
(commits)  (PR merged)  (issue resolved)  (velocity, cost, reliability)

The RL environment improves the dispatch model, which improves agent success rates, which reduces three business-level costs:

Human review time. Every PR that needs human edits costs reviewer hours. A model that learns to produce clean merges directly reduces the review burden. Measurable as: time between PR creation and merge, trending downward.
Issue throughput. The current system processes one issue per 30-minute timer tick, with a 60% first-attempt success rate. Improving prompt construction and agent selection increases the number of issues resolved per day without adding compute. Measurable as: issues closed per week with the hermes-review label.
Escalation cost. Every human-required escalation means the autonomous system failed and a human must context-switch to understand and resolve the issue. The quality gate (trained by RL) reduces wasted attempts by predicting failure before spending 20 minutes of compute. Measurable as: escalation rate trending toward zero.

The RL loop creates a flywheel: better dispatch → more clean merges → more outcome data → better reward signal → better dispatch. The business metric that captures this is autonomous resolution rate — the percentage of hermes-ready issues that reach hermes-review (PR created) without human intervention. The target is >80%.

Mapping to Atropos Concepts

Atropos Concept	Hermes Equivalent
Environment	`HermesIssueEnv` — fetches issues, dispatches agents, scores outcomes
Item (prompt)	Codeberg issue title + body + repo metadata
Trajectory (rollout)	Agent’s response: code changes, commits, PR
Reward signal	Multi-signal: immediate (syntax, structure) + delayed (PR merge)
Group	Multiple attempts on the same issue (GRPO-style)
Metadata	Telemetry JSON blob from telemetry.py

Environment Design

Config

from pydantic import Field
from atroposlib.envs import BaseEnv, BaseEnvConfig

class HermesIssueEnvConfig(BaseEnvConfig):
    codeberg_repos: str = Field(
        default="brenner-axiom/hermes-test-sandbox",
        description="Space-separated list of repos to scan",
    )
    codeberg_token: str = Field(default="", description="Codeberg API token")
    honcho_workspace: str = Field(default="hermes", description="Honcho workspace")
    max_issue_tokens: int = Field(default=2048, description="Max tokens for issue text")
    lookback_days: int = Field(default=7, description="Days to look back for delayed rewards")
    use_delayed_rewards: bool = Field(default=True, description="Include PR merge signal")

class HermesIssueEnv(BaseEnv):
    name = "hermes-issue-worker"
    env_config_cls = HermesIssueEnvConfig

Data Flow

┌──────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Codeberg   │     │  HermesIssueEnv  │     │ Atropos Trainer │
│   Issues     │────▶│  (RPi5 or local) │────▶│  (GPU server)   │
│              │     │                  │     │                 │
│  hermes-ready│     │  get_next_item() │     │  Receives:      │
│  label       │     │  score_response()│     │  - tokens       │
└──────────────┘     │  collect_traj()  │     │  - masked_tokens│
                     └──────────────────┘     │  - logprobs     │
                              ▲               │  - rewards      │
                              │               └────────┬────────┘
                     ┌────────┴────────┐               │
                     │  Delayed Reward │               │
                     │  (retrospective)│        ┌──────▼──────┐
                     │                 │        │  Trained    │
                     │  PR merged: +0.7│        │  dispatch   │
                     │  PR rejected:-0.3│       │  model      │
                     │  Human edit:+0.2│        └─────────────┘
                     └─────────────────┘

`get_next_item` — Issue Fetcher

Fetches the oldest open issue with hermes-ready label from configured repos. Returns the issue as a structured item with title, body, labels, and repo metadata. Returns None when no issues are available (environment pauses).

async def get_next_item(self):
    for repo in self.config.codeberg_repos.split():
        issues = await self.codeberg_api(
            "GET",
            f"/repos/{repo}/issues"
            f"?labels=hermes-ready&state=open&sort=created&direction=asc&limit=1"
        )
        if issues:
            issue = issues[0]
            return {
                "repo": repo,
                "issue_id": issue["number"],
                "title": issue["title"],
                "body": issue["body"] or "",
                "labels": [l["name"] for l in issue.get("labels", [])],
                "repo_file_count": await self.get_repo_file_count(repo),
            }
    return None

`collect_trajectory` — Agent Dispatch + Scoring

Constructs a prompt from the issue, sends it to the model being trained (the dispatch model), and scores the output. The dispatch model generates a structured decision: which agent, what prompt enrichment, and what context to include.

async def collect_trajectory(self, item):
    # The dispatch model generates the agent invocation strategy
    dispatch_prompt = self.build_dispatch_prompt(item)

    async with self.server.managed_server(tokenizer=self.tokenizer) as managed:
        completion = await managed.chat_completion(
            messages=[
                {"role": "system", "content": DISPATCH_SYSTEM_PROMPT},
                {"role": "user", "content": dispatch_prompt},
            ],
            n=1,
            max_tokens=2048,
            temperature=0.7,
        )

        state = managed.get_state()
        node = state["nodes"][0]
        decision = completion.choices[0].message.content

        # Execute the decision (actually run the agent)
        outcome = await self.execute_dispatch(item, decision)

        # Score based on outcome
        reward = self.compute_reward(item, decision, outcome)

        return ScoredDataItem(
            tokens=node.tokens,
            masked_tokens=node.masked_tokens,
            logprobs=node.logprobs,
            score=reward,
        ), []

Reward Function

The reward function maps to the Outputs → Results → Outcomes causal chain (reference). Each step moves further from agent control and closer to real-world impact:

Outputs → Results → Outcomes
(What the agent delivered) → (What it produced) → (What changed because of it)

Reward = Output Score + Result Score + Outcome Score

Layer	Timing	Agent Control	Examples
Output	Immediate	Full	Commits, PR created, code compiles
Result	Hours	Partial	PR merged, tests pass in CI, no human edits needed
Outcome	Days–weeks	Indirect	Issue resolved, follow-on work unblocked, codebase improved

Every dispatch carries an implicit hypothesis:

If we deliver [code changes] (output), we expect [a clean PR merge] (result), which should drive [the issue being resolved and the codebase improving] (outcome).

A break anywhere in the chain signals failure — commits without a merge (output without result), or a merge that requires human fixes (result without clean outcome).

Output Signals (immediate, under agent control)

Signal	Reward	Condition
Agent completed without error	+0.1	exit_code == 0
Commits were made	+0.2	commits > 0
PR was created	+0.1	pr_url is not None
Reasonable time spent	+0.1	30s < elapsed < 600s
Code compiles/parses	+0.1	syntax check passes
Issue referenced in commit	+0.1	commit message contains #N
Agent was blocked	-0.2	blocked == true
Agent timed out	-0.3	outcome == timed_out
No output produced	-0.2	outcome == no_commits and no findings

Result Signals (hours later, partially under agent control)

Results measure whether the output was adopted — did the PR merge cleanly? The agent can influence this by producing correct, well-tested code, but the human reviewer is the gatekeeper.

Signal	Reward	Condition
PR merged without changes	+0.7	merged and not human_modified
PR merged with human edits	-0.3	merged but human had to fix it
PR closed (rejected)	-0.5	closed without merge
First-attempt success	+0.2	bonus: merged on attempt 1

Human edits are negative. If a human had to modify the PR before merging, the agent’s output was incomplete or incorrect. The model should learn to produce PRs that merge without intervention. A merge with edits is an output that produced a result, but not a clean one.

Outcome Signals (days–weeks later, indirect agent influence)

Outcomes measure the meaningful change — was the issue actually resolved? Did the work improve the codebase? Did it unblock further progress? These are lagging indicators influenced by many factors beyond the agent’s control.

Signal	Reward	Condition
Issue closed (resolved)	+0.1	issue state == closed after PR merge
Issue still open after 7 days	-0.1	stale despite PR being merged
Spawned follow-on issues	+0.3	issues referencing this one exist
Follow-on issues merged easily	+0.2	bonus: follow-ons merged on attempt 1
Codebase regression	-0.4	follow-on issues are bug fixes for this PR

Follow-on issues are positive. Good PRs sometimes spawn follow-on work (tests, docs, refactoring). If those follow-on issues are resolved easily (first-attempt merge), the original PR set up the codebase well — the agent made good architectural decisions.

Regressions are strongly negative. If follow-on issues are bug fixes for code introduced by this PR, the agent introduced defects. The distinction between “spawned productive follow-on work” and “caused bugs that needed fixing” is the difference between an output that drove positive outcomes and one that drove negative ones.

def compute_output_reward(self, outcome):
    """Score the deliverable itself. Fully under agent control."""
    reward = 0.0

    if outcome["exit_code"] == 0:
        reward += 0.1
    if outcome["commits"] > 0:
        reward += 0.2
    if outcome.get("pr_url"):
        reward += 0.1
    if 30 < outcome["elapsed_seconds"] < 600:
        reward += 0.1
    if outcome["outcome"] == "blocked":
        reward -= 0.2
    if outcome["outcome"] == "timed_out":
        reward -= 0.3
    if outcome["outcome"] == "no_commits" and outcome["findings"] == 0:
        reward -= 0.2

    return max(min(reward, 1.0), -1.0)

def compute_result_reward(self, telemetry, pr_data):
    """Score whether the output was adopted. Partially under agent control."""
    reward = 0.0

    if pr_data and pr_data.get("merged"):
        if pr_data.get("human_modified"):
            # Output produced a result, but not a clean one
            reward -= 0.3
        else:
            # Clean adoption — output → result chain intact
            reward += 0.7
        if telemetry["attempt"] == 1:
            reward += 0.2  # First-attempt bonus
    elif pr_data and pr_data["state"] == "closed":
        # Output rejected — chain broken at result layer
        reward -= 0.5

    return reward

def compute_outcome_reward(self, issue_data, follow_on_issues=None):
    """Score the meaningful change. Indirect agent influence."""
    reward = 0.0

    # Was the issue actually resolved?
    if issue_data.get("state") == "closed":
        reward += 0.1
    else:
        # Issue still open 7+ days after PR merged
        reward -= 0.1

    if follow_on_issues:
        # Classify follow-ons: productive work vs regressions
        bug_fixes = [
            f for f in follow_on_issues
            if any(l in f.get("labels", []) for l in ["bug", "fix", "regression"])
        ]
        productive = [f for f in follow_on_issues if f not in bug_fixes]

        if productive:
            reward += 0.3  # Spawned productive follow-on work
            easy_merges = sum(
                1 for f in productive
                if f.get("merged_on_attempt", 99) == 1
            )
            if easy_merges > 0:
                reward += 0.2  # Follow-ons merged easily (good architecture)

        if bug_fixes:
            reward -= 0.4  # Introduced regressions (negative outcome)

    return reward

def compute_total_reward(self, outcome, telemetry, pr_data,
                         issue_data, follow_on_issues=None):
    """Total reward across the Outputs → Results → Outcomes chain.

    Hypothesis: If we deliver [code changes] (output), we expect
    [a clean PR merge] (result), which should drive [the issue
    being resolved and the codebase improving] (outcome).
    """
    output_r = self.compute_output_reward(outcome)
    result_r = self.compute_result_reward(telemetry, pr_data)
    outcome_r = self.compute_outcome_reward(issue_data, follow_on_issues)

    return output_r + result_r + outcome_r

The three reward functions correspond to three questions:

Output: What did the agent deliver? (commits, PR, code quality)
Result: What did the output produce? (clean merge, or human had to fix it)
Outcome: What changed because of it? (issue resolved, codebase improved or regressed)

Dispatch Model Decision Format

The model being trained outputs structured JSON:

{
  "agent": "claude",
  "context_strategy": "include_file_listing",
  "prompt_enrichment": [
    "List existing files before making changes",
    "Run tests after modifying code"
  ],
  "estimated_difficulty": "medium",
  "should_attempt": true,
  "confidence": 0.75,
  "reasoning": "Issue asks for dependency migration, needs file context"
}

If should_attempt is false, the environment skips the dispatch and reports hermes-needs-clarification — this is the quality gate.

Training Modes

Online (Full Loop)

The environment runs on the RPi5, fetches real issues, dispatches real agents, and sends scored trajectories to a remote Atropos trainer. This requires:

Atropos server on a GPU machine
Network connectivity RPi5 ↔ trainer
Real Codeberg issues being processed
Slow iteration (30min per issue)

Offline (Batch Learning)

The retrospective.py already collects telemetry + PR outcomes. Export this as a dataset and train offline:

Export all telemetry JSON blobs from Codeberg issue comments
Join with PR merge/reject outcomes
Construct ScoredDataGroup entries
Train the dispatch model on historical data

This is faster (no waiting for real issues) and lower risk (no real PRs created).

Hybrid (Recommended Start)

Phase 1: Collect telemetry for 50-100 issues (current system, no changes)
Phase 2: Train offline on collected data, validate quality gate predictions
Phase 3: Deploy trained model as the dispatch decision-maker
Phase 4: Switch to online RL with Atropos for continuous improvement

Data Pipeline

Codeberg Issues
     │
     ▼
hermes-issue-worker.sh → telemetry.py → Codeberg comments (JSON)
                                       → Honcho sessions
     │
     ▼ (daily)
retrospective.py → lessons → Honcho memory
                 → digest  → Codeberg tracking issue
     │
     ▼ (export)
export_training_data.py → ScoredDataGroup JSONL
     │
     ▼
Atropos trainer → updated dispatch model
     │
     ▼
quality_gate.py (uses trained model for predictions)

Export Script

# export_training_data.py — extract training data from Codeberg telemetry
def export_scored_groups(repos, output_path):
    """Export telemetry + outcomes as Atropos-compatible JSONL."""
    for repo in repos:
        issues = get_all_issues_with_telemetry(repo)
        for issue in issues:
            telemetry_entries = parse_telemetry_comments(issue)
            pr = find_linked_pr(issue)

            for entry in telemetry_entries:
                prompt = build_dispatch_prompt(issue)
                immediate_reward = compute_reward_from_telemetry(entry)
                delayed_reward = compute_delayed_reward(entry, pr)

                scored_item = {
                    "prompt": prompt,
                    "response": entry,
                    "immediate_reward": immediate_reward,
                    "delayed_reward": delayed_reward,
                    "total_reward": immediate_reward + delayed_reward,
                    "metadata": {
                        "repo": repo,
                        "issue_id": issue["number"],
                        "attempt": entry["attempt"],
                        "outcome": entry["outcome"],
                    },
                }
                write_jsonl(output_path, scored_item)

Infrastructure Requirements

Component	Where	Resources
HermesIssueEnv	RPi5 or local machine	Minimal (API calls only)
Atropos trainer	GPU server	1x GPU (A100/H100 for 7B model)
Dispatch model	RPi5 (inference)	~4GB RAM for quantized 7B
Codeberg API	External	Rate-limited, use caching
Honcho	External (managed)	Included in plan

Evaluation

async def evaluate(self):
    """Periodic evaluation: accuracy of dispatch decisions."""
    # Fetch recent outcomes from Codeberg
    recent = get_recent_completed_issues(days=7)

    metrics = {
        "success_rate": count_merged / count_total,
        "first_attempt_rate": count_first_attempt / count_merged,
        "escalation_rate": count_human_required / count_total,
        "avg_attempts": sum_attempts / count_total,
        "avg_time_to_merge": avg_merge_time_hours,
    }

    self.wandb_log(metrics)

Implementation Phases

Phase 1: Data Collection (current — in progress)

Telemetry.py captures per-attempt data
Retrospective.py generates daily lessons
Honcho stores cross-session context
Accumulate 50+ issues of telemetry

Phase 2: Offline Analysis

export_training_data.py — extract telemetry as JSONL dataset
Analyze success/failure correlations (prompt length, issue labels, etc.)
Train simple classifier (logistic regression or small transformer)
Deploy as quality_gate.py (#4)

Phase 3: Atropos Environment

hermes_issue_env.py — BaseEnv subclass
Reward function with immediate + delayed signals
Dispatch model training on GPU server
Evaluation pipeline

Phase 4: Online RL

Deploy trained dispatch model on RPi5 (quantized)
Replace --match heuristic with model inference
Continuous online training via Atropos
A/B testing: model dispatch vs heuristic dispatch

Open Questions

Model size: Can a quantized 7B model run on RPi5 for inference? 4GB RAM is tight with the 512MB container limit. May need a separate inference service.
Delayed reward attribution: When a PR is merged days later, how do we attribute the reward back to the specific trajectory? Atropos supports offline scoring, but the pipeline needs to be built.
Exploration vs exploitation: Early on, the model should try different dispatch strategies (exploration). Later, it should converge on what works (exploitation). The temperature parameter and issue sampling strategy control this.
Safety: The dispatch model decides whether to attempt an issue. A bad model could either attempt everything (wasting compute) or nothing (starving the pipeline). The 3-attempt escalation limit provides a safety floor.
Cold start: Until enough data accumulates, the heuristic-based --match and hard-coded retry limit are fine. The RL environment enhances, not replaces, the existing system.