HomeSample Page

Sample Page Title


What does an finish to finish stack for terminal brokers seem like whenever you mix structured toolkits, artificial RL environments, and benchmark aligned analysis? A workforce of researchers from CAMEL AI, Eigent AI and different collaborators have launched SETA, a toolkit and setting stack that focuses on reinforcement studying for terminal brokers. The undertaking targets brokers that function inside a Unix type shell and should full verifiable duties underneath a benchmark harness similar to Terminal Bench.

Three principal contributions:

  • A state-of-the-art terminal agent on Terminal Bench: They obtain state-of-the-art efficiency with a Claude Sonnet 4.5 based mostly agent on Terminal Bench 2.0 and with a GPT 4.1 based mostly agent on Terminal Bench 1.0. The comparability is restricted to brokers that use the identical base mannequin.
  • Scalable RL coaching with artificial terminal environments: The analysis workforce launch an preliminary artificial dataset with 400 terminal duties that cowl a spread of problem ranges. Out of those, 260 duties are used for RLVR finetuning of a Qwen3-8B mannequin.
  • A clear agent design that generalizes throughout coaching and analysis frameworks: The identical agent implementation is used for each native process runs and the official Terminal Bench analysis harness.

Terminal Toolkit and log construction

The SETA code repository showcases a Terminal Toolkit that turns a language mannequin into an executable terminal agent. For every process run, the framework creates a structured log listing underneath analysis/terminal_bench_run. The README web page reveals a concrete structure for a process referred to as play-zork.

Key information embrace:

  • chatagent.log which data the total historical past of agent messages and power calls together with take a look at outcomes.
  • A classes listing with session_logs that seize terminal interactions from the toolkit.
  • Inside session_logs, information similar to blocking_commands.log, session_run_zork_1_correct_path.log, session_zork-1.log, and session_zork_start.log retailer command output for various classes and modes.
  • assessments.log and assessments.log.strip which document the take a look at run output, with the latter eradicating terminal management characters.

This construction offers a concrete option to debug an agent. You possibly can hint from excessive stage chat selections in chatagent.log right down to particular person shell instructions within the session logs and ensure success or failure from the take a look at logs.

For official Terminal Bench analysis, the GitHub repository offers a separate entry level underneath analysis/terminal_bench_eval. A developer strikes into that listing and runs run_eval.sh for Terminal Bench 1.0 and run_tb2.sh for Terminal Bench 2.0.

Outcomes are written into analysis/terminal_bench_eval/run/{run_id}/outcomes.json. Activity particular session logs are positioned underneath analysis/terminal_bench_eval/logs/camel_logs/{task_id}. The agent class that binds the CAMEL agent to the benchmark is carried out in tbench_camel_agent.py.

Observe Taking Toolkit as persistent reminiscence

The analysis workforce additionally introduces a Observe Taking Toolkit described as persistent reminiscence for lengthy horizon duties. They present instance word taking instrument calls the place the agent writes and reads notes in a structured means whereas fixing terminal duties. The present public materials focuses on the existence of this toolkit and the examples of use. It doesn’t but describe a full coaching goal for word utilization.

The essential level is that the agent has an specific channel the place it may well externalize intermediate outcomes and hints, separate from the uncooked terminal buffer.

Understanding the efficiency

SETA’s agent harness achieves main outcomes on Terminal Bench. With Claude Sonnet-4.5 because the spine, the CAMEL terminal agent reaches 46.5% accuracy on Terminal Bench 2.0 throughout 89 actual world duties, rating first and outperforming the second system by 3 share factors, with particularly sturdy ends in git workflows, DevOps automation, and code safety duties. On Terminal Bench 1.0, a GPT 4.1 based mostly agent attains 35% accuracy, which is 4.7 share factors above the following entry, once more inside the identical mannequin household. Compared, a supervised Qwen3 8B baseline attains 3.4% on Terminal Bench 2.0, and the Qwen3 8B terminal agent educated with the SETA RL pipeline improves over this baseline on the curated artificial environments.

Key Takeaways

  • SETA is a joint group undertaking that gives each agent toolkits and artificial RL environments particularly for terminal brokers, aligned with the Terminal Bench analysis format.
  • The framework studies state-of-the-art efficiency for CAMEL terminal brokers on Terminal Bench 1.0 and a pair of.0 when utilizing Claude Sonnet 4.5 and GPT 4.1 as the bottom fashions, evaluated towards brokers constructed on the identical mannequin households.
  • The SETA RL dataset on Hugging Face comprises 400 artificial terminal duties, every packaged as process.yaml, Dockerfile, and run-tests.sh, with 260 duties used for RLVR finetuning of a Qwen3-8B based mostly agent.
  • The open supply SETA codebase exposes a Terminal Toolkit with structured logging and a Observe Taking Toolkit for lengthy horizon reminiscence, and integrates straight with Terminal Bench analysis scripts and logging paths within the seta GitHub repository.
  • The general design demonstrates a clear path from artificial RL environments to benchmark verified brokers, giving builders a reproducible stack to coach, debug, and consider terminal brokers moderately than counting on advert hoc instrument calling examples.

Take a look at the Weblog, Technical particulars, GitHub Repo and Weights. Additionally, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be part of us on telegram as nicely.

Take a look at our newest launch of ai2025.dev, a 2025-focused analytics platform that turns mannequin launches, benchmarks, and ecosystem exercise right into a structured dataset you possibly can filter, evaluate, and export.


Michal Sutter is a knowledge science skilled with a Grasp of Science in Information Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and information engineering, Michal excels at remodeling advanced datasets into actionable insights.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles