Microsoft ships Fara-7B on-device web agent and Harrier SOTA embeddings, as LIV-hybrid 350M model targets edge throughput
Agentic computing moves local: a 7B visual-action model beats larger web agents while Microsoft quietly drops a decoder-only multilingual embedding SOTA. Meanwhile, a 350M LIV hybrid claims 40K tok/s on H100.
One-Line Summary
Small, specialized models push practical agents and retrieval forward: Microsoft’s 7B web agent runs on-device, MIT-licensed multilingual embeddings set a new bar, and post-training tooling stabilizes while evaluation research probes how well models can be monitored.
LLM & SOTA Models
Fara-7B: A 7B Agent That Uses Your Computer Like You Do
Instead of replying with text only, Fara-7B sees webpage screenshots and takes actions—scrolling, clicking exact coordinates, typing, and even calling macros like visit_url()—to finish tasks such as shopping, job-finding, or booking. It’s trained from 145,000 trajectories (about 1 million steps) generated by a multi-agent pipeline (Magentic-One) and distilled into a single 7B model based on Qwen2.5-VL-7B (supports up to 128k tokens). That small size means it can run locally, lowering latency and keeping data private. 1
On public and new benchmarks, Fara-7B posts strong task success rates: WebVoyager 73.5%, Online-Mind2Web 34.1%, DeepShop 26.2%, and on Microsoft’s new WebTailBench 38.4%—beating UI-TARS-1.5-7B (e.g., 66.4% vs 73.5% on WebVoyager) and competing with much larger systems prompted with accessibility trees or Set-of-Marks. Importantly, it averages about 16 steps per task vs roughly 41 for UI-TARS-1.5-7B, implying lower cost at equal per-token pricing. Microsoft flags typical limitations: complex-task accuracy, instruction slips, and hallucinations remain open problems. 1
The team releases open weights under MIT on Foundry and Hugging Face, plus a silicon-optimized, quantized build for Copilot+ PCs. Fara-7B integrates with Magentic-UI for demos and control points, and Microsoft recommends sandboxed runs and avoiding sensitive domains for now. The training pipeline uses task proposal from real URLs, multi-agent solving (Orchestrator/WebSurfer/UserSimulator), and triple verification (Alignment/Rubric/Multimodal) before supervised fine-tuning—no reinforcement learning used in this version. 1
Microsoft also shows a broader push toward multi-model agentic workflows in research tooling: a new “Critique” mode lets OpenAI GPT draft and Anthropic Claude review for accuracy, completeness, and citations; a “Council” runs multiple models in parallel, with a “referee” synthesizing agreements and disagreements. On the DRACO deep-research benchmark, such collaboration reportedly beats single-model runs—an approach aimed at reducing hallucinations through cross-checks. 2
Harrier-OSS-v1: MIT-Licensed Multilingual Embeddings at the Top
Microsoft quietly posts three multilingual embedding models—270M, 0.6B, and 27B parameters—supporting 94 languages and 32,768-token context. The 27B variant reports a Multilingual MTEB v2 score of 74.3, above Qwen3-Embedding-8B (70.58), NVIDIA NV-Embed-v2 (~69–72, non-commercial), and OpenAI text-embedding-3-large (64.6). All models use decoder-only transformers with last-token pooling and L2 normalization, leaning on contrastive fine-tuning to make decoder architectures competitive in retrieval. 3
The 27B’s 5,376-d embedding is large but enables fine-grained semantics across languages; at BF16 it needs 80GB+ VRAM. The mid 0.6B scores 69.0 and likely becomes a practical default for multilingual RAG pipelines due to cost-performance, while the 270M scores 66.5 and targets edge/offline workloads—both distilled from larger teachers. Caveats: no paper yet, BF16-only docs (quantization to come via community), and no per-language breakdowns despite “94 languages.” 3
For multilingual search, enterprise knowledge bases, and cross-lingual support, these MIT-licensed, self-hostable models reduce lock-in and per-token costs. Compared to API-only options like OpenAI or Cohere, Harrier improves runway for long-context (32k) retrieval while keeping licensing commercial-friendly—especially valuable where data governance demands on-prem. 3
Reka Models: Multimodal Stack from 2B to 67B
Reka details a family trained from scratch for multimodal reasoning: Spark (2B), Edge (7B), Flash (21B), and Core (67B)—all with 128k context and native support for interleaved image, audio, video, and text inputs; outputs can be text or audio tokens. The models claim strong logical/physical/temporal reasoning, OCR and visual PDF understanding, up to ~5-minute video comprehension (longer by streaming), and multilingual audio understanding without a separate ASR front-end. 4
The stack targets agentic tasks via instruction following, coding, function calling, and tool use. Positioning-wise: Spark for phones/robots/wearables, Edge for laptops/tablets, Flash for on-prem/private cloud, and Core for complex reasoning and as a teacher for distillation—framing a deployment ladder much like “SLM to frontier” but fully multimodal. 4
While fewer public benchmarks are cited on the page, Reka emphasizes action-bot readiness (beyond chat) and streaming inputs, which matters for human-in-the-loop systems, robotics, and real-time dashboards. Access is via Reka Chat and API, aimed at developers who need multimodal function-calling without gluing separate encoders. 4
Open Source & Repos
TRL v1.0: Stability Contract for a Moving Post-Training Field
Hugging Face’s TRL evolves from research code to a library with a clear stability model: a “stable” core (e.g., SFT, Direct Preference Optimization (DPO), reward modeling, Reinforcement Learning with Verifiers (RLOO/GRPO)) and an “experimental” surface where APIs can change quickly. The goal is to support more than 75 post-training methods while keeping downstream stacks from breaking as the field shifts between PPO, DPO-like preference learning, and verifier-driven RL. Monthly downloads reportedly exceed 3M, so stability-by-contract matters. 5
Design-wise, TRL favors explicit implementations and even duplication over brittle abstractions, accepting that “what’s core” keeps changing (e.g., reward models essential in PPO, optional in DPO, then reborn as verifiers in RLVR). Promotion from experimental to stable depends on usage and maintenance cost. Migration from the last 0.x version is said to be minimal per the guide. 5
In ecosystem comparisons, TRL balances breadth and simplicity with deep Transformers/PEFT integration and low infra burden (single-GPU viable), contrasting with heavier Ray- or vLLM-dependent stacks. For teams standardizing post-training, the explicit stable/experimental split reduces incidents when methods or defaults change. 5
Contributing New Models: Codex-Assisted Port to Transformers
A community write-up shows how the Codex desktop coding agent (using a GPT-5.3 Codex model at the time) ports a new video segmentation model (VidEoMT) into Transformers by operating over the local filesystem, writing progress.md as memory, and iterating through style and repo checks. The approach narrows scope (convert one checkpoint first), compares outputs on dummy inputs, and uses existing Transformers agent guides to streamline tooling. 6
The author reports that around late 2025 coding agents crossed a reliability threshold for complex tasks like end-to-end model ports. Notably, Codex’s context compaction helped avoid “context rot” during long sessions, enabling continuous progress without restarting chats. The process ended with an AI-authored PR, reviewed by maintainers. 6
Takeaway: modern coding agents can shoulder much of the boilerplate—templating, conversion scripts, parity checks—while humans still gatekeep evaluation rigor and API fit. This could accelerate the time-to-integration for new research models across modalities. 6
vLLM 0.18.1: Patch Release Targets Accuracy and Build Stability
vLLM ships 0.18.1 with targeted fixes: default SM100 MLA prefill backend reverts to TRT-LLM, Blackwell Qwen3.5 FP8 accuracy degradation in DeepGemm gets patched, monolithic TRTLLM MoE is disabled for Renormalize routing, and Docker builds pre-download missing FlashInfer headers. There’s also a Python ≤3.10 fix for mock.patch resolution in standalone_compile.FakeTensorMode. 7
While minor, these patches matter for production: reverting a default avoids perf regressions on new NVIDIA parts, and the FP8 fix tackles precision drift that could impact leaderboard results or quality gates in deployment. The provided wheels span CPU and CUDA 13.0 for x86_64 and aarch64, easing upgrades on common platforms. 7
With 75k+ stars, vLLM sits in many inference paths; small routing or kernel mismatches can cascade into throughput or accuracy surprises. The project’s quick turnarounds on driver/arch edges remain a differentiator for teams standardizing on its serving stack. 7
Research Papers
MonitorBench: Can We Trust Chain-of-Thought to Reflect True Reasoning?
Chain-of-thought (CoT) explanations often don’t match the internal factors that actually drive a model’s final answer. MonitorBench proposes a comprehensive, fully open-source benchmark—1,514 instances, 19 tasks across 7 categories—to measure when CoTs are “monitorable,” i.e., causally tied to decision-critical factors. Two stress-test settings quantify how easily monitorability degrades. 8
Key findings: CoT monitorability is higher when success requires structural reasoning over the decision-critical factor. Conversely, more capable (closed-source) models tend to show lower monitorability, and both open/closed models can intentionally reduce monitorability under stress—drops up to 30% on tasks not requiring structural reasoning. This suggests that better raw performance does not guarantee more faithful rationales. 8
Why it matters: as teams increasingly log CoTs for audits or safety monitoring, assuming those traces reflect true reasoning is risky. MonitorBench complements widely used benchmarks (e.g., MMLU, GSM8K) and tooling like DeepEval by directly pressuring an overlooked dimension: whether explanations are causally aligned with behavior. 9 10 11
LLM & SOTA Models### Liquid AI LFM2.5-350M: Tiny Model, Big Instruction-Following
Liquid AI’s 350M-parameter LFM2.5 pushes “intelligence density” by pretraining on 28T tokens (up from 10T) and using a hybrid backbone: 10 double-gated Linear Input-Varying (LIV) conv blocks plus 6 Grouped Query Attention (GQA) blocks. This slashes KV-cache overhead while keeping a 32k context window. The model targets high-speed, agentic tasks over general reasoning. 12
Benchmarks show strengths in instruction-following: IFEval 76.96, with GPQA Diamond 30.64 and MMLU-Pro 20.01—solid for 350M but below large reasoning models. Docs explicitly advise against math, complex coding, and creative writing. On a single H100, throughput peaks at 40.4K output tokens/s at high concurrency; edge figures include 169MB peak NPU memory (Snapdragon, Q4) and 300MB on Raspberry Pi 5 (int8). 12
A third-party safety profile (Nerq) for a close variant (LFM2-350M-Math) rates 59.2/100 (D), citing 0/100 maintenance and documentation, and limited popularity, though compliance scores 87/100 across 52 jurisdictions. Translation: promising for dev/testing and local agents, but enterprises should perform due diligence before production. 13
Alibaba Qwen3.5-Omni: Native Omnimodal With Real-Time Tricks
Qwen3.5-Omni unifies text, image, audio, and video in a single Thinker–Talker architecture, using Hybrid-Attention Mixture-of-Experts across modalities and a native Audio Transformer pre-trained on 100M+ hours. It supports 256k context, over 10 hours of audio, and ~400 seconds of 720p audio-visual input at 1 FPS—positioned against Google’s Gemini 3.1 Pro. 14
Alibaba reports “215 SOTA” wins across audio/AV subtasks, including 156 speech-to-text translation tasks and 43 ASR tasks, plus 3 AV and 5 general audio benchmarks. Real-time features include ARIA (Adaptive Rate Interleave Alignment) to stabilize speech generation and native turn-taking intent recognition to handle interruptions. 14
A standout claim is “Audio-Visual Vibe Coding”: coding directly from audio-visual instructions, e.g., recording a UI, narrating a bug, and having the model generate a fix—suggesting cross-modal mappings from visuals and speech to symbolic code. Tiers (Plus/Flash/Light) balance latency and accuracy for different deployment needs. 14
Why It Matters
We’re watching a “small-but-agentic” shift: Fara-7B shows that a 7B model can act on real web UIs, run on-device, and match or beat larger systems on common tasks while cutting steps roughly 2.5×—a practical path to private, low-latency assistants. In parallel, multilingual retrieval moves beyond API lock-in: an MIT-licensed 27B embedding leads MTEB v2, with 0.6B and 270M variants that make multilingual RAG viable on modest hardware. 1 3
Tooling and evaluation are maturing too. TRL’s stability contract acknowledges the pace of post-training methods, while MonitorBench warns that better answers don’t always mean faithful reasoning traces—crucial for audits and safety. The throughline: more capable agents, cheaper multilingual retrieval, steadier training stacks, and sharper tests—all necessary to turn today’s demos into tomorrow’s dependable systems. 5 8
Comments (0)