Measuring Human-Level Intelligenceđź§ đź’ˇ

ARC-AGI-3 tests real-time intelligence in games, Nemotron leads in distilled reasoning, and Manus shows how context engineering boosts agent performance without new models.

From redefining what it means to measure intelligence to building reasoning-first models and engineering context like code—this week’s breakthroughs mark a shift toward deeper, more human-like AI. ARC-AGI-3 is setting a bold new standard for testing adaptive intelligence in dynamic environments. NVIDIA’s OpenReasoning-Nemotron models are pushing the boundaries of distilled reasoning power. And context engineering is emerging as the secret weapon for building high-performing agents without retraining models.

Let’s dive into what’s shaping the next generation of truly intelligent systems. 👇

ARC-AGI-3 Benchmarking True Intelligence

ARC-AGI-3 is an upcoming Interactive Reasoning Benchmark aiming to measure human-like intelligence in AI by evaluating systems on their skill-acquisition efficiency in novel, hand-crafted environments without prior instructions. This benchmark features around 100 unique game-based environments, where agents must perceive, decide, and act across multiple steps—a test of real-world generalization and learning efficiency. ARC-AGI-3 focuses on interactive reasoning capabilities such as exploration, memory, planning, and goal acquisition, emphasizing that intelligence unfolds over time through experience and adaptation. Unlike historical benchmarks like Atari games, ARC-AGI-3 eliminates language, trivia, and cultural dependencies to test core knowledge priors. Currently in preview with six sample games, ARC-AGI-3 invites the public to play or build agents, with open calls for innovative game ideas and competitions to help refine the benchmark ahead of its full 2026 launch.

OpenReasoning-Nemotron Distilled Reasoning Models

OpenReasoning-Nemotron is a newly released suite of high-performance reasoning-focused language models by NVIDIA, built through extensive distillation from the DeepSeek R1 0528 671B model. Available in four sizes—1.5B, 7B, 14B, and 32B—these models excel across reasoning benchmarks in math, science, and code, achieving state-of-the-art results for their size classes. Trained using Supervised Fine-Tuning (SFT) on over 5 million high-quality reasoning solutions, the models are part of NVIDIA's OpenReasoning initiative, with tools, datasets, and training code available through the NeMo-Skills framework. Notably, these models can operate in a "GenSelect" inference mode by combining multiple reasoning traces to select the best solution, significantly boosting performance in complex tasks like AIME and HMMT math challenges. Designed as research foundations rather than general-purpose assistants, they offer strong baselines for reinforcement learning experimentation and efficiency studies, and can be particularly valuable in developing specialized reasoning agents. The dataset used for this distillation is slated for future release, while models are openly accessible on Hugging Face for further exploration and fine-tuning.

Context Engineering for Smarter AI Agents

Context engineering for AI agents, as practiced in the building of Manus, involves structuring and managing the information fed to an AI agent so it performs complex, multi-step tasks efficiently. Instead of training custom models, Manus focused on optimizing the agent's use of existing frontier models by carefully designing how context is presented at each step. This includes maximizing KV-cache hit rates by keeping prompt prefixes stable, using append-only contexts, and masking actions rather than dynamically removing tools to maintain performance. Manus also uses the file system as external memory, allowing the model to read and write information without exceeding context limits. Additionally, goals are recited in the context to hold the model’s attention and previous errors are retained to help the model learn and adapt. The key insight is that strong agent behavior comes from shaping context well—not just from stronger models.

Hand Picked Video

Turn your videos into professional content in just one click with our AI-powered background remover. Easily erase or replace any background with custom images, solid colors, or video clips—right in your browser. Perfect for creators, marketers, and businesses. Export in high-quality formats and preview in real-time. Try it free and transform your videos today!

Top AI Products from this week

  • Jeeva AI - From prospecting to enrichment to personalized outreach, Jeeva automates the grunt work so you can focus on building relationships and closing deals. With 2.0, Jeeva now sorts your inbox, preps you instantly for every call, manages your follow-ups, and even takes notes for you.

  • Jupitrr AI - Jupitrr AI makes personal branding videos effortless for coaches, consultants, creators and thought leaders. Upload your talking‑head video and watch it add stock footage, captions, hook text, web images and more tailored to your message. Want edits? Ask Levio, your video‑editing AI agent, to tweak anything you need.

  • Trae 2.0 - Trae 2.0 brings SOLO to everyone, which is your all-in-one Context Engineer that doesn’t just assist with code, but thinks, plans, builds, and ships complete features end-to-end, with the right information and tools.

  • Stakpak.dev - Stakpak.dev is an Open-source DevOps agent written in Rust, helps you secure, deploy, and maintain production-ready infrastructure.

  • Brainfork - Brainfork is a personal knowledge and decision MCP server that allows you to regain control of your knowledge and maximise the possibility of a personal RAG service.

  • Saidar - Saidar is an intelligent personal assistant that can automate your admin tasks using 25+ of your softwares. You can set automations, generate reports, manage your email and operations, all through simple natural language text! Eg. Ask it to "Email me a daily stock report at 8AM

This week in AI

  • OpenAI IMO Breakthrough - OpenAI's new experimental model solved 5 of 6 IMO 2025 problems, achieving gold-level performance—a first for AI in this elite math competition.

  • DuckDuckGo AI Image Filter - DuckDuckGo now lets users hide AI-generated images in search results via a new “AI images” filter, using curated blocklists to reduce low-quality AI content.

  • Math Shortcuts in Language Models - MIT researchers found language models use math shortcuts, not step-by-step tracking, to predict changes. Guiding these patterns could boost AI reasoning in dynamic tasks like code or weather.

  • Baby Grok Announced - xAI is creating Baby Grok, a kid-friendly AI app designed for safe and appropriate content, as announced by Elon Musk.

  • AI Persuasion Vulnerability - MIT researchers found classic human persuasion techniques double LLMs’ compliance with objectionable requests, revealing "parahuman" responses and key safety challenges.

Paper of The Day

This paper demonstrates that Google's Gemini 2.5 Pro achieved near gold-medal performance on the International Mathematical Olympiad (IMO) 2025 by solving 5 out of 6 problems correctly. The researchers developed a sophisticated pipeline involving iterative solution improvement and verification, where the model generates initial solutions, reviews and refines them, and uses a custom verifier to identify errors and gaps for correction. The successfully solved problems spanned combinatorics, geometry, number theory, sequences, and game theory, requiring deep mathematical insight rather than pattern matching. This represents a significant breakthrough in AI mathematical reasoning, showing that current LLMs can tackle olympiad-level mathematics when equipped with proper prompting strategies and verification systems.RetryClaude can make mistakes. Please double-check responses.

To read the whole paper, go to here.