[AINews] AI Engineer World's Fair — Autoresearch, Memory, World Models, Tokenmaxxing, Agentic Commerce, and Vertical AI Call for Speakers

Latent Space 2 信息等级 2 发布：2026-05-02T07:21 抓取：2026-05-03 12:54

行业 AI 动态

摘要

AI Engineer World's Fair 2026年夏季会议发布第二轮演讲征集，新增Autoresearch、Tokenmaxxing、Memory、World Models、Agentic Commerce等主题。会议将在Moscone West举办，规模翻倍，预计观众较2025年峰值增长至少一倍，月活超百万AI工程师。

客观事实

AIE World's Fair 2026第二轮演讲征集启动，特设新主题
今年首度移至Moscone West，规模连续第三年翻倍
AIE月均覆盖超百万AI工程师，观众增速超2025年峰值

AI Engineer World's Fair Moscone West San Francisco

原文

TL;DR: we are announcing Wave 2 Call for Speakers for AIE World’s Fair this summer - apply here: https://sessionize.com/aiewf2026/ ESPECIALLY if you have projects relevant to our new tracks in Autoresearch, Memory, World Models, Tokenmaxxing, Agentic Commerce, and Vertical AI in Law, Healthcare, GTM and Finance!
In January we laid out plans for Scaling without Slop and despite some content exhaustion risk, your reception has been positive, with AIE viewership now trending to at least double 2025’s peak, serving over a million unique AI engineers a month.
This year is our first in Moscone West, doubling for the 3rd year in a row in our mission to bring all of the AI Engineering world to San Francisco to showcase the must-know research and product engineering work of the year, as well as to hire, fundraise, and close business deals. Sales are going well, but traditionally we do one callout a year for the World’s Fair to widen our net for people who might not traditionally think to submit a talk (because they didn’t know we were interested!).
This year we are adding an entire day’s worth of talks to the schedule, so on top of the all the evergreen themes we covered in 2025 and in Europe, we’re adding a few more new ones that I am specifically soliciting applications (and sponsors!) to cover:
Autoresearch: recursive self improvement loops in harnesses and model training!
Tasteful Tokenmaxxing: as a company leader, how do you make your AI Eng teams 10x more AI-Native/scale AI adoption, BUT without Goodharting waste?
Memory: how are your agents/models improving as your users use them?
World Models: how are you solving spatial intelligence and adversarial reasoning?
Agentic Commerce: how are agents paying for data, APIs, and other agents?
Vertical AI in Law, Healthcare, GTM and Finance: how are you applying AI in these specific domains? We are also open to submissions for AI in Government and AI in Education, though generally these seem less fast-moving.
Robotics: last year, Physical Intelligence, Waymo, Tesla, Nvidia, K-Scale (RIP) and others presented their approaches to autonomy; this year WE ARE ALLOCATING FREE EXPO FLOOR SPACE FOR GOOD ROBOTICS DEMOS. (contact hello@ai.engineer to set up your demo area! Humanoids must be accompanied.)
Founders: a new Startup Battlefield event will be added where you can pitch your pre-series A company to our panel of top VCs and guest judges.
There are other new tracks, which you can find in the full application form (don’t constrain yourself to tracks, just submit your best work and we’ll find a place for you)
If you already applied and were accepted in Wave 1, you should receive an email in your inbox informing you so - if not, don’t fret, you’ll still be considered in Wave 2, no further action needed.
This is for everyone else who weren’t aware we are soliciting applications for the biggest technical AI event of the year - especially if you know someone who would be PERFECT to talk about some of these topics we are calling out, then we need your help to reach them.
Apply here - and book your ticket/travel asap (because things are filling up fast for the World Cup also taking place in SF that week) — we will refund successful applicants. (Also contact hello@ai.engineer if you need an invitation letter for international visa).

AI News for 4/30/2026-5/1/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!
AI Twitter RecapGrok 4.3’s Release, Benchmark Deltas, and the Open-vs-Closed Frontier
xAI shipped Grok 4.3 with materially better cost/performance, but mixed eval reception: Early chatter flagged an imminent API launch from @scaling01, followed by a detailed benchmark breakdown from Artificial Analysis. On their Intelligence Index, Grok 4.3 scores 53, up 4 points over Grok 4.20, with roughly 40% lower input and 60% lower output pricing. The biggest gain was on GDPval-AA, up 321 Elo to 1500, suggesting stronger real-world agentic task performance. It also hit 98% on τ²-Bench Telecom and held 81% on IFBench. The tradeoff: AA-Omniscience accuracy rose while non-hallucination dropped by 8 points, leaving concerns about reliability despite stronger capability. Arena has already added it across text, vision, document, and code modes via @arena.
Community reaction was split between “meaningful iteration” and “still behind top open models”: Several posts argued Grok is improving faster than critics admit, including @teortaxesTex, who noted token-efficiency gains as well, while others were more skeptical. @scaling01 claimed “Grok-4.3 still behind chinese open-source”, and Andon Labs reported a major regression on Vending-Bench 2, where Grok allegedly preferred to “sleep” rather than act. The more structural critique came from pricing and infra economics: @teortaxesTex argued Grok’s low prices may be subsidized by poor hardware utilization and that cache economics, not only model quality, increasingly determine agentic TCO.
DeepSeek V4 Pro, Vision/Spatial Reasoning, and Open-Weights Closing the Gap
DeepSeek V4 Pro appears to be the most credible open-weight coding/agent model in this batch: The strongest hands-on report came from @omarsar0, who tested DeepSeek-V4-Pro inside the Pi coding agent and described it as the first open-weight model that genuinely feels comparable to Codex or Claude Code for multi-turn agentic coding. Key systems details included 1M context, a hybrid CSA/HCA attention design, KV cache reduced to 10%, and nearly 4x lower inference FLOPs at long context. The report also emphasized practical harness fit: no custom setup, stable traces, and viable multi-step research/coding loops on Fireworks inference.
The broader benchmark picture confirms open weights are now much closer, though still behind on hardest tasks: Artificial Analysis noted that the three leading open-weight models released last week—Kimi K2.6, MiMo V2.5 Pro, and DeepSeek V4 Pro—now score 52–54 on the Intelligence Index, versus 57 for Gemini 3.1 Pro Preview and Claude Opus 4.7, and 60 for GPT-5.5. These top open models are all trillion-plus MoE systems with permissive licenses: Kimi at 1T/32B active, MiMo at 1T/42B active, and DeepSeek V4 Pro at 1.6T/49B active. The remaining gap is concentrated in HLE, CritPt, TerminalBench Hard, and hallucination-heavy Omniscience.
DeepSeek’s multimodal direction seems centered on explicit spatial grounding: Speculation about DeepSeek-Vision outperforming V4-Pro on ARC-AGI-2 because of actual spatial reasoning came from @teortaxesTex. A later summary of a briefly posted-and-deleted tech report from ZhihuFrontier described a multimodal CoT system that can “point while thinking” using boxes and points embedded directly into reasoning traces to reduce the “reference gap” in counting, maze solving, and path tracing. The stack reportedly uses DeepSeek-ViT, CSA compression, and V4-Flash (284B total / 13B active). Even if early tests still show weaknesses, it is a notable architectural bet: turning visual reasoning into explicit grounded computation rather than plain text description.
Codex’s Rapid Product Expansion vs Claude Code, Devin, and Other Agent Runtimes
Codex is winning on product velocity and UX polish, not just base model quality: A major theme across tweets was how quickly the Codex app is improving. High-engagement praise came from @gdb, @theo, and others comparing its feel favorably to alternatives. OpenAI added a device toolbar for responsive testing and improved browser-use speed by ~30% in “vibe testing,” per @JamesZmSun. It also added CI status in chat via @reach_vb, migration/import tooling for settings/plugins/agents via OpenAI, and a surprisingly viral pets system in Codex via @OpenAIDevs. While whimsical, the repeated point from users was that OpenAI is shipping a cohesive environment, not just a model endpoint.
Codex vs Claude Code is increasingly framed as UX + speed + taste tradeoffs: @theo summarized the current frontier coding vibe: GPT-5.5 is “smarter and can unblock you,” while Opus 4.7 has better intent/taste but can wander. In a second post, he argued Claude Code feels much slower on TTFT/TPS and requires more tool calls, while GPT/Codex feels more direct and economical for “fast mode” style use (tweet). Still, public benchmark comparisons are mixed: @scaling01 said GPT-5.5 did not beat Opus 4.7 on PostTrainBench in the Claude Code harness, highlighting how much results remain harness-dependent.
Other agent runtimes are converging on similar primitives: Devin launched “inside your shell” hotkey access via @cognition. Hermes added a /goal loop with a supervisor model forcing the agent to continue until completion, via @Teknium. Flue, introduced by @FredKSchott, positions itself as a TypeScript framework for headless autonomous agents, “like Claude Code but programmable.” The common pattern across these launches is that the competitive surface is moving from raw model IQ to agent harness design: subagents, browser-use, durable state, compaction, skills, and feedback loops.
Agent Infrastructure: Retrieval, Memory, HITL, and Durable Execution
The strongest research signal was that agent systems are bottlenecked by runtime design, not just model quality: Two especially useful papers were highlighted. First, ReaLM-Retrieve, summarized by @omarsar0, argues that reasoning models need retrieval during inference rather than only before it. It reports +10.1% absolute F1 over standard RAG and 47% fewer retrieval calls than fixed-interval IRCoT, with 3.2x lower per-retrieval overhead. Second, OCR-Memory, shared by @dair_ai, stores long-horizon trajectories as images with indexed anchors, retrieving exact prior content instead of lossy text summaries; it reports SOTA on Mind2Web and AppWorld under strict context limits.
LangChain/LangGraph pushed hard on production primitives for multi-user and human-in-the-loop agents: @sydneyrunkle outlined three concrete multi-user deployment concerns—data isolation, delegated credentials, and operator RBAC—and mapped each to LangSmith Agent Server features. Later posts covered a new HITL mode where a human reply can be returned directly as a tool result (tweet) and durable pause/resume semantics for consequential actions or unresolved judgment calls (tweet). This is a good snapshot of where real deployment complexity is moving: auth boundaries, persistent state, and explicit intervention points.
Durable execution is becoming a first-class runtime feature across stacks: Cloudflare announced Dynamic Workflows for adding durable execution to agent plans via @celso. LangChain positioned create_agent as the low-level primitive beneath Deep Agents, with extensibility for filesystems, bash, compaction, hooks, and subagents via @Vtrivedy10. The meta-point is consistent with one linked technical blog: the agent runtime itself—sandboxing, replay, checkpointing, orchestration—has become hidden technical debt and a major source of differentiation.
Research and Systems Papers Worth Bookmarking
Recursive / latent-space multi-agent coordination is emerging as a serious alternative to text-only agent chatter: @omarsar0 summarized Recursive Multi-Agent Systems, where agents communicate through shared latent recursive computation instead of full natural-language exchanges. Reported gains: 8.3% average accuracy improvement, 1.2x–2.4x end-to-end speedup, and 34.6%–75.6% token reduction across nine benchmarks. If agent-to-agent communication cost becomes dominant, this line of work matters.
Meta FAIR’s “self-improving pretraining” idea may be one of the more consequential training-time papers in the batch: @omarsar0 highlighted a method where a strong post-trained model rewrites pretraining suffixes toward safer, higher-quality continuations and then judges model rollouts during RL-style pretraining. Reported improvements include 36.2% relative gain in factuality, 18.5% in safety, and up to 86.3% win rate in generation quality over standard pretraining.
Microsoft’s synthetic long-horizon computer-use worlds look like a credible data recipe: @dair_ai described a system that creates 1,000 synthetic computers with realistic files and documents, then runs 8-hour agent simulations averaging 2,000+ turns. The thesis is straightforward and important: for computer-use agents, the bottleneck is no longer only model capability but scalable, realistic experiential data.
Top tweets (by engagement)
OpenAI/Codex momentum: OpenAI says GPT-5.5 is its strongest launch yet, with API revenue growing 2x faster than prior releases and Codex doubling revenue in under seven days.
Defense/government adoption: The U.S. “Department of War” CTO announced agreements with seven frontier AI and infrastructure companies to deploy capabilities on classified networks.
OpenAI messaging pivot on labor: Sam Altman: “we want to build tools to augment and elevate people, not entities to replace them”, with follow-up comments on jobs and future work here.
Codex adoption and delight: “codex app becoming incredible” from @gdb, plus Codex pets unexpectedly becoming one of the day’s biggest product-engagement hits.
Model benchmarking reality check: ARC Prize reports GPT-5.5 at 0.43% and Opus 4.7 at 0.18% on ARC-AGI-3, with analysis of failure modes.
AI Reddit Recap/r/LocalLlama + /r/localLLM Recap1. Qwen Model Developments and BenchmarksPFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 (Activity: 339): The post introduces PFlash, a speculative prefill technique for long-context decoding on quantized 27B targets using C++/CUDA, achieving a 10x speedup over vanilla llama.cpp on an RTX 3090. This method leverages a small drafter model to score token importance, allowing the main model to focus only on significant spans, thus reducing prefill time significantly. The implementation combines insights from recent papers on speculative prefill and block-sparse attention, and is executed entirely in C++/CUDA without Python or PyTorch, making it efficient for consumer-grade GPUs like the RTX 3090. The repository is available on GitHub. Some commenters express skepticism about the claimed 10x speedup, with one noting the approach as potentially ‘super lossy’ due to its compression method. Another user reports out-of-memory issues on a 4090, indicating potential challenges in replicating the results.
randomfoo2 highlights a novel approach in PFlash that involves using a smaller Qwen3-0.6B drafter to process the full 64K/128K prompt with FlashPrefill/BSA-style sparse attention, which reduces the computational cost. The drafter evaluates token/span importance, retaining only a crucial subset for the 27B target model to prefill, followed by speculative decoding using DFlash+DDTree on the compressed target KV. This method is noted for being ‘super lossy,’ indicating potential trade-offs in accuracy for speed.
qwen_next_gguf_when raises concerns about the practicality of the PFlash method, noting that the DFlash component tends to run out of memory (OOM) on an RTX 4090. This suggests potential limitations in hardware compatibility or efficiency, which could impact the method’s replicability and scalability across different systems.
Obvious-Ad-2454 expresses skepticism about the claimed 10x speedup, suggesting it might be too optimistic without independent verification. This comment underscores the importance of replication studies to validate performance claims in machine learning, especially when such significant improvements are reported.
Qwen 3.6 27B vs Gemma 4 31B - making Packman game! (Activity: 994): In a local LLM gamedev contest, Gemma 4 31B outperformed Qwen 3.6 27B in creating a Pac-Man style game on a MacBook Pro M5 Max with 64GB RAM. Gemma processed 27 tokens/sec and completed the task in 3m 51s with 6,209 tokens, while Qwen processed 32 tokens/sec over 18m 04s with 33,946 tokens. Despite Qwen’s more creative and visually styled output, Gemma’s solution was shorter, clearer, and more logical, excelling in game logic, interaction handling, and performance stability. The task required generating a complete HTML-based game with procedural graphics and no external libraries, focusing on smooth gameplay and stable performance using requestAnimationFrame and delta time for animations. Commenters noted the humor in the prompt’s demand for ‘no bugs’ and questioned the utility of vague prompts, suggesting they primarily test a model’s pre-existing knowledge rather than its problem-solving ability.
Qwen 3.6 27B was tasked with creating a Pacman clone using a single HTML page and any libraries or graphics sources it deemed necessary. Interestingly, the model did not perform any external downloads or research, instead relying on its pre-existing knowledge to code the game. This highlights the model’s ability to generate functional code from minimal prompts, though it raises questions about the depth of its understanding and adaptability to new resources.
A user pointed out that the ghost enemy movement in the Gemma 4 31B version of the Pacman game appears to be malfunctioning. This suggests potential issues with the model’s ability to accurately implement game logic, particularly in handling dynamic elements like enemy AI, which is crucial for a game like Pacman.
The discussion raises concerns about the utility of using vague prompts for testing AI models, as noted by a commenter who described such prompts as “benchmaxxing tests.” This implies that the tests may not effectively evaluate the model’s problem-solving capabilities or its ability to adapt to new tasks, but rather assess its pre-existing knowledge base.
Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models (Activity: 437): The Qwen Team has released Qwen-Scope, a set of Sparse Autoencoders (SAEs) for the Qwen 3.5 models, ranging from 2B to 35B MoE. This tool maps internal features across all layers, functioning as a dictionary of the model’s internal concepts, allowing for precise manipulation of features such as ‘legal talk’ or ‘Python code’. Key functionalities include Surgical Abliteration to suppress specific features, Feature Steering to activate desired concepts, Model Debugging to identify token-triggered directions, and Dataset Analysis to verify feature activation. The tool is released under the Apache 2.0 license but with a caution against removing safety filters. A practical example includes diagnosing unexpected language switches using a heatmap to identify over-activated features. More details can be found in the Qwen-Scope paper and the Hugging Face Space. Commenters highlight the significance of this release, noting it as potentially the largest open-source interpretability tool for dense models, surpassing Google’s GemmaScope in scale. There is anticipation for future iterations, such as Qwen 3.6, to incorporate similar tools.
NandaVegg highlights the significance of the release of Sparse Autoencoders (SAEs) for the dense 27B Qwen model, noting it as potentially the largest open-source interpretability tool to date. This is in contrast to previous tools like GemmaScope, which only supported smaller models such as 9B and 2B, indicating a substantial advancement in model interpretability capabilities.
robert896r1 expresses anticipation for the release of Qwen 3.6 or community-driven adaptations of the current tools for newer iterations. This reflects a common trend in the AI community where tools and models are rapidly iterated upon, and there is a need for compatibility with the latest versions to maintain relevance and utility.
oxygen_addiction speculates on the use of feature steering in large AI models, such as ChatGPT5, suggesting that advanced routing mechanisms could be employed to select the most appropriate model for a given prompt. This points to a potential future where AI systems dynamically optimize their responses by leve