AI evaluation goes explicit: learned rubrics slash text-to-image labeling to 0.01%
AutoRubric-T2I teaches a vision‑language judge to grade images with learned checklists, outperforming prior reward models while using under 0.01% of human preference data. New papers also push execution‑grounded coding agents and steadier long‑context attention.
One-Line Summary
Evaluation is turning into code: learned rubrics replace black-box rewards, execution-grounded tests guide agents, and linear attention cleans up long-context memory.
Research Papers
AutoRubric-T2I: learned rubrics cut human labels to 0.01%
AutoRubric-T2I teaches a vision-language model (VLM) judge to grade text-to-image (T2I) outputs with explicit, learned checklists rather than opaque scalar scores, so models can be evaluated and trained with less than 0.01% of the usual human preference labels. 1
Instead of training a Bradley–Terry (BT) preference model on huge datasets, the system synthesizes reasoning traces from image preference pairs into candidate rubrics, asks the VLM judge to score images under each rubric, and then learns preferences from the pairwise score gaps; an L1‑regularized logistic refiner picks the Top‑N most discriminative rules. 1
On image‑reward benchmarks such as MMRB2, the learned‑rubric judge outperforms strong reward model baselines, and when used as a reinforcement learning (RL) signal it improves downstream T2I tasks like TIIF and UniGenBench++ using the Flow‑GRPO (Group Relative Policy Optimization) pipeline on diffusion models. 1
The practical win is interpretability: the rules explain why an image "passes," which makes reward shaping and error analysis simpler and cheaper to adapt than retraining a black‑box; watch for ports of rubric learning to video and editing where fine‑grained criteria matter. 1
Rule2DRC: execution-grounded benchmark for chip design rules
Rule2DRC is a benchmark that tests whether large language model (LLM) agents can translate natural‑language chip rules into executable design rule checking (DRC) scripts, and it scores them by execution outcomes on layouts, not code similarity; it includes 1,000 rule‑to‑script tasks and 13,921 evaluation layouts. 2
It also introduces SplitTester, a tester agent that uses execution feedback to generate discriminative test cases and separate previously indistinguishable candidate scripts, which substantially improves Best‑of‑N selection; the authors release code and provide an evaluation pipeline that avoids feeding layouts to the agent, making results about functional correctness rather than leakage. 2
MOSS: agents that fix themselves by rewriting source code
MOSS is a system that lets autonomous agent systems improve themselves by rewriting their own source code based on production failures, not just tweaking prompts, skills, or workflow graphs. 3
Each evolution runs through a deterministic multi‑stage pipeline: failure evidence is curated, code edits are delegated to a pluggable external coding‑agent command‑line interface (CLI), candidates are validated by replaying the batch in ephemeral trial workers, and promotion requires user consent with health‑probe‑gated rollback. 3
On OpenClaw, a production agent substrate, one self‑rewrite cycle lifts a four‑task mean grader score from 0.25 to 0.61 without human intervention — a concrete sign that source‑level adaptation can close recurring gaps between intent and behavior. 3
Gated DeltaNet-2: cleaner memory edits for linear attention
Gated DeltaNet‑2 refines linear attention by giving the model separate channel‑wise gates for erasing old content and writing new content in its fixed‑size recurrent memory, reducing interference compared with a single scalar gate that controls both. 4
Trained at 1.3B parameters on 100B FineWeb‑Edu tokens, it shows the strongest overall results among Mamba‑2, Gated DeltaNet, Kimi Delta Attention, and Mamba‑3 variants across language modeling, commonsense reasoning, and retrieval, with a notable lead on long‑context RULER "needle‑in‑a‑haystack" multi‑key retrieval; the paper also details a gate‑aware backward pass for efficient parallel training. 4
Open Source & Repos
Dust: custom agent platform ships dsbx CLI v0.1.16
Dust is a platform for building custom AI agents for work, and the project ships a dsbx command‑line interface (CLI) — the latest tagged version is v0.1.16 (commit 03aea9a3f1d5c8a6b13ebafaa2f5417bd8624403) with a Linux x86_64 binary listed. 5
The repository highlights user guides and a developer platform, signaling active iteration on agent tooling; teams exploring workplace agents can start from the docs and track CLI releases to see platform capabilities expand. 5
Why It Matters
Across these results, evaluation becomes explicit and executable: learned rubrics make reward signals interpretable and data‑efficient, execution‑grounded testing separates scripts that "look right" from those that actually pass, and self‑rewriting agents turn failure evidence into fixes without waiting for a human release cycle. 1
On the modeling side, cleaner memory edits in linear attention point toward steadier behavior on long documents and retrieval‑heavy tasks — a practical complement to agentic advances and better reward modeling. 4
This Week, Try It
- Read AutoRubric‑T2I method & rubric examples: skim the paper figures and method section on arXiv. https://arxiv.org/abs/2605.17602
- Install Dust’s dsbx CLI (Linux): visit the repo and follow the Releases notes to download v0.1.16. https://github.com/dust-tt/dust
Comments (0)