This Week in AI: SeeDance 2.0 Breaks Reality, Gemini Does Theoretical Physics, and Open Models Catch Up

There's a benchmark in AI that nobody planned for, but everyone recognizes instantly. It's not ARC AGI. It's not SWE-bench. It's the Will Smith spaghetti test. Back in 2023, early AI video models produced this cursed clip of Will Smith struggling to eat pasta—fingers melting into noodles, physics giving up entirely, reality collapsing frame by frame. It became the poster child for "AI can't do video yet."
This week, Seed Dance 2.0 from ByteDance released their updated version. Will Smith eats spaghetti. Normally. The fork works. The physics work. The lips sync. It's possibly the weirdest benchmark out there—but it tells you everything you need to know about where video generation just landed. And that's just one story from a week that also brought us a Google model doing theoretical physics, coding agents running 20x faster, and open-source models that are nearly indistinguishable from frontier labs. If you thought the gap between "impressive demo" and "breaks reality" was years away, this week proved otherwise.
Video Generation Crossed the Uncanny Valley
This wasn't an incremental week for AI video. This was the week video generation stopped being a party trick and started feeling dangerous.
Seed Dance 2.0 Changes Everything
ByteDance—the company behind TikTok—dropped Seed Dance 2.0, and it's the most capable video model available right now. Not "impressive for AI." Just impressive.
The model supports four input modalities: text, image, audio, and video. No other video model on the market gives you all four. It generates 15-second high-quality multi-shot clips with dual-channel audio and lip-syncing that's better than anything we've seen before. The realism isn't just good—it's crossing into territory where you have to pause and ask yourself if what you're watching is real.
Here's what people are generating with it: UGC product videos that look like they were shot on an iPhone. Motion graphics that look professionally animated. Celebrity deepfakes with spot-on voice cloning. Trademarked IP like Lord of the Rings, SpongeBob, and One Piece—all rendered with zero friction because ByteDance, being a Chinese company, apparently doesn't care about US copyright law the way American companies do.
One user built an OpenClaw agent that crawled a product page, extracted photos and specs, then fed everything into Seed Dance 2.0 to auto-generate a product ad. Another created a 15-second Lord of the Rings scene where Frodo suggests taking the eagles to Mount Doom. Another made Luffy from One Piece throw an Apple laptop overboard. The quality isn't "good for AI." It's just good.
And that's the problem—or the opportunity, depending on where you sit. Sora, Veo, and other US-based video models are locked down with copyright protections. They won't let you generate Mickey Mouse or Marvel characters. ByteDance? They don't seem to care. Which means their models can do things American models legally can't. And that gap is going to get wider, not smaller.
The model isn't available in the US yet—supposedly rolling out February 24th—but people have been finding workarounds, most of which ByteDance has since closed. If you're outside the US, you can access it now. If you're in the US, you're waiting.
Kling 3.0 Gets Accessible
Last week, we talked about Kling 3.0—the ultra-realistic video model that was taking hours to generate a single clip. This week, it showed up inside Leonardo AI, and suddenly it's actually usable.
One tester generated a video with Kling 3.0 on Krea.ai and waited over an hour. The same prompt in Leonardo? A couple of minutes. The model's the same. The infrastructure's just better. And now you can actually use it without burning half your day waiting for a render.
The output quality is strong—15-second clips, realistic motion, solid character consistency. It's not quite Seed Dance 2.0 levels of "wait, is this real?" but it's close. And unlike Seed Dance, it's accessible in the US right now.
Workflows for Power Users
Both Runway and Krea AI rolled out new workflow features this week—node-based visual systems that let you build multi-step video generation pipelines.
Runway's Story Panels lets you create a catalog of shots with consistent characters, locations, and styles. You feed it images, connect them in a workflow, and it maintains consistency across scenes. Krea's Prompt to Workflow does something similar—takes your prompt and breaks it into a ComfyUI-style node graph where you can fine-tune every step.
These aren't for casual users. If you just want to type a prompt and get a video, the standard interfaces still work fine. But if you're a power user who needs precise control over characters, camera angles, and scene transitions, these node-based systems give you that control. Think of it as the difference between auto mode on a camera and shooting full manual.
The Speed Wars
Speed became a competitive advantage this week. Not just "faster inference." We're talking about models that generate results so fast it feels like real-time interaction.
GPT 5.3 Codex Spark Is Absurdly Fast
OpenAI dropped GPT 5.3 Codex Spark this week, and the defining feature isn't smarter—it's faster. Way faster.
The model uses Cerebras chips, which specialize in ultra-fast inference. When you give it a coding prompt, the response comes back at over 1,000 tokens per second. That's 20x faster than the standard GPT 5.3 Codex model.
OpenAI released a side-by-side demo. Same prompt: "Build a simple HTML snake game." Standard Codex takes 45 seconds. Codex Spark? Six seconds. Done. Tested. Playable.
One tester gave it a prompt to build a Vampire Survivors clone—a full game with XP mechanics, upgrades, enemies, and auto-firing weapons. Time to completion? 50 seconds. The game worked. You could play it. Level up. Pick upgrades. The logic was there. No sound effects, minimal graphics, but the core loop? Functional.
This isn't just impressive—it changes how you interact with coding agents. You're not waiting 30 seconds between iterations anymore. You're prompting, getting a result, testing, prompting again—all in the time it used to take for a single response. It feels like pair programming with someone who types faster than you can think.
The model is only available to ChatGPT Pro users ($200/month) and works inside the Codex app, CLI, and VS Code extension. It's not as accurate as the full GPT 5.3 Codex model—speed comes with tradeoffs—but for rapid prototyping and iterative vibe coding, it's a different experience entirely.
What This Means for Vibe Coding
We've been talking about vibe coding for months—the idea that you describe what you want and an AI builds it. But the friction has always been the wait time. You prompt, you wait, you test, you prompt again, you wait again. Each cycle takes 30–60 seconds, and that adds up fast.
Codex Spark collapses that loop. You're not waiting anymore. You're iterating in near real-time. And that fundamentally changes what you can build in a single session. A game that would've taken an hour of back-and-forth? You're doing it in 10 minutes now. A landing page with multiple sections and interactions? Five minutes, maybe less.
The limiting factor isn't the model anymore. It's how fast you can think of the next prompt.
Google Goes Full Physics Nerd
While everyone else was optimizing for speed or realism, Google decided to build a model that solves theoretical physics problems. Because of course they did.
Gemini 3 Deep Think Dominates Benchmarks
Gemini 3 Deep Think is a specialized model designed for advanced science and research. And when we say "advanced," we mean it. This thing scored gold medals on the 2025 International Physics Olympiad and Chemistry Olympiad. It achieved a 50.5% score on the Condensed Matter Theory benchmark, which tests proficiency in advanced theoretical physics.
On ARC AGI 2—a reasoning benchmark that tests whether AI can learn new patterns it's never seen before—Gemini 3 Deep Think crushed everything. It didn't just beat Claude Opus 4.6 and GPT 5.2. It destroyed them. The gap isn't close. It's dominant.
ARC AGI tests visual reasoning by showing an AI a pattern, then asking it to apply that pattern to a new problem. For humans, it's relatively easy. For AI models, it's historically been nearly impossible because models can't "learn" after training—they're frozen. But Gemini 3 Deep Think has some kind of emergent ability to absorb new information and apply it in ways other models can't.
On Humanity's Last Exam—a benchmark filled with obscure scientific questions like translating Roman inscriptions or knowing how many paired tendons a specific hummingbird bone supports—Gemini 3 Deep Think again leads the pack. These aren't questions most humans could answer. But Gemini 3 Deep Think answers them consistently.
On CodeForces, a competitive coding benchmark, it hit an ELO score of 3,455. That puts it in the top 10 human competitive coders in the world. Not "good for AI." Top 10 globally.
The $250/Month Problem
Here's the catch: Gemini 3 Deep Think is only available to Google AI Ultra subscribers—the $250/month tier. Most people don't need a model that solves condensed matter theory problems. But if you're working in scientific research, advanced engineering, or competitive coding, this is the most capable model available right now.
The challenge is that most users won't have use cases that justify the cost. Unless you're asking theoretical physics questions or solving Olympiad-level chemistry problems, you're paying for capabilities you'll never use. But for the people who do need it? There's nothing else close.
Open-Source Nearly Catches Up
For years, the narrative has been that open-source models are "pretty good" but always a generation behind frontier labs. This week made that narrative a lot harder to defend.
GLM5 Builds a Game Boy Emulator in 24 Hours
GLM5 from Zhipu AI is an open-source model that's benchmarking at near-frontier levels. On Humanity's Last Exam, it scored 50.4 with tools—beating Claude Opus 4.5, Gemini 3 Pro, and GPT 5.2 with tools. On SWE-bench Verified and Browse Comp (testing agentic browsing), it's on par with the leading closed models.
But benchmarks are one thing. Real-world demos are another.
A research team gave GLM5 a goal: build a working Game Boy Advanced emulator with a 3D graphical interface. They gave it a system prompt, hardware documentation, and 24 hours. Then they stepped back.
The model built the emulator. Autonomously. It created a testing loop—coded, tested, logged results, identified bugs, fixed them, repeated. It didn't wait for human input. It didn't ask for help. It just executed a 24-hour loop of iteration until the emulator worked.
The final result? A fully functional Game Boy emulator with a movable 3D interface, controller support, and the ability to load ROMs and play games. Built entirely by an AI agent working autonomously for 24 hours.
This is the shift everyone's been talking about: you don't give the model a task anymore. You give it a goal. The model makes a plan, executes, tests, adjusts, and keeps going until it hits the goal. No human in the loop. Just autonomous iteration.
And this is an open-source model. You can run it yourself if you've got the hardware (two M3 Ultra Mac Studios with 512GB RAM each, about $20K total). That's not consumer-grade, but it's also not a locked-down API you're renting access to.
Minimax M2.5: $1/Hour Continuous Run
Minimax M2.5 is another open-source model that's punching at frontier-level performance. On SWE-bench, it's matching Claude Opus 4.6. On Browse Comp, it comes in second. And it's optimized to be absurdly cheap.
Run the model continuously for an hour at 100 tokens per second? One dollar. Run it slower? 30 cents. Want to run four instances continuously for an entire year? $10,000.
Compare that to Claude Opus 4.6, which costs roughly 120x more. Minimax isn't just competitive—it's a fraction of the cost.
The model is designed for complex agentic workflows—office work, coding, data analysis, multi-step tool use. You can upload a zip file of 100 invoices and ask it to organize everything into a spreadsheet. It handles it. You can ask it to create a 15-slide PowerPoint presentation on coffee chain financials. It builds it—properly formatted, branded, designed.
This is what cost-efficient frontier performance looks like. You're not sacrificing much capability. You're just paying 1% of what you'd pay for a closed model.
The "Goal Not Task" Shift
Both GLM5 and Minimax M2.5 represent a shift in how we interact with AI. You're not breaking tasks into discrete steps anymore. You're setting a goal and letting the agent figure out the path.
Want a landing page? Don't prompt it step-by-step. Give it the goal, provide some context, and let it plan the copy, design, images, and code autonomously. Want a financial analysis? Upload the data, describe what you need, walk away. The agent handles the rest.
This is where open-source models are starting to close the gap. They're not just "good enough" anymore. They're legitimately competitive with the best closed models—and in some cases, they're better at sustained autonomous work.
Audio Finally Works
Audio generation has been lagging behind image and video for a while. This week, it caught up.
SoulX Singer Clones Singing Voices
SoulX Singer is a voice cloning model designed specifically for singing. You give it a few seconds of someone's voice and a reference song, and it makes them sing.
The demos are wild. Obama singing "Everybody Loves My Baby." A deep male voice singing "Happy Birthday." Anyone's voice singing any melody. The quality is good enough that it's not immediately obvious it's AI-generated.
The model is flexible—you can hum a melody, input lyrics, and have a specific voice sing your composition. It's small (under 3GB), runs on low-end GPUs or even a CPU, and it's fully open-source. Everything's on GitHub with instructions to download and run locally.
If you're doing cover songs, original music, or just want to mess around with AI-generated singing, this is the best tool available right now.
MOSS TTS and MOTTS
We also got two new state-of-the-art text-to-speech models this week: MOSS TTS and MOTTS.
MOSS TTS supports multiple languages and is exceptionally good at voice cloning. A few seconds of reference audio and it can replicate tone, cadence, and style with high accuracy. It beats Qwen Audio on similarity scores and supports expressive multi-speaker dialogues—perfect for podcasts or audiobooks.
MOTTS is more specialized—English and Japanese only—but it's ultra-lightweight (244MB) and optimized for natural, expressive speech. It sounds less robotic than most TTS models, and because it's so small, you can run it on a CPU with no issues.
Both models are open-source. Both are released. If you need TTS for production work, you've got two solid new options.
Just Dub It: Auto-Dubbing with Lip Sync
Just Dub It takes a video and dubs it into another language while applying lip-sync so it looks like the person is actually speaking the new language.
French, Portuguese, German—whatever language you need, it translates the audio and adjusts the lip movements to match. The quality is good enough for social media content, educational videos, or localized marketing.
It's built on LTX2, an open-source video model with native audio support, which means it's fast and can run on consumer GPUs. The GitHub repo is live with full instructions. If you're doing multilingual content, this is a tool worth testing.
The Stuff That Actually Matters
Not everything this week was flashy. Some updates were just useful.
Pico Claw is an ultra-efficient alternative to OpenClaw—the framework that lets you run AI agents 24/7 on a server and control them via Telegram or WhatsApp. OpenClaw worked, but it was bulky and painful to install. Pico Claw requires 99% less memory (just 10MB), boots in 1 second (400x faster), and costs 98% less to run than a Mac Mini. It does everything OpenClaw does—just leaner and faster. The GitHub repo already has 6,000 stars in a few days. If you want a persistent AI agent without the overhead, this is it.
FreeFuse solves a specific but annoying problem in image generation. When you use multiple LoRAs (fine-tunes for specific styles, characters, or effects), they often conflict—faces bleed together, styles overwrite each other, the output gets distorted. FreeFuse uses adaptive token-level routing to prevent that. You can stack multiple character LoRAs, props, and styles in a single generation without errors. It supports ComfyUI and Ideogram Turbo, with Flux support coming soon. If you work with LoRAs regularly, this fixes a major pain point.
NanBeige 4.13B is a tiny 3-billion-parameter model that punches way above its weight. It scores 12.6 on Humanity's Last Exam without tools—impressive for something so small. It can handle 500+ rounds of tool invocations, meaning you can set it on an autonomous task and let it work for hours. It's under 8GB, so it fits comfortably on most consumer GPUs. If you need a capable local model that doesn't require enterprise hardware, this is one of the best options available.
Nvidia's DuoGen is a sequential image model designed for step-by-step tutorials. You ask it to create a cooking tutorial or a DIY guide, and it generates a series of images with accompanying text—each step visually consistent with the last. It can also do standard image editing (swapping clothes, changing backgrounds), but its real value is in generating coherent multi-step sequences. Useful for instructional content, training data for robots, or any workflow that needs visual consistency across multiple frames. It's currently just a research paper—no public release yet—but worth watching.
Qwen Image 2.0 launched this week with 2K native resolution, better typography, and faster inference. It's particularly strong at generating infographics, diagrams, and multi-element compositions. The model can take a complex prompt specifying exact text, layouts, data tables, and images, then generate everything accurately in one shot. It's available now at chat.qwen.ai, though the interface is a bit clunky. If you need precise control over text-heavy image generation, it's worth testing.
DeepGen 1.0 is another new image model that beats top competitors on text rendering and editing benchmarks. It can generate images, edit existing ones, solve mazes, predict what happens next in a scene, or generate alternate views. It's on par with Ideogram Turbo and even beats Ideogram on some editing tasks. The model is open-source, but it's 72GB—too big for most consumer setups. If you've got the hardware, it's a strong alternative to closed models like Midjourney or DALL-E.
What Agencies Do Next
Here's what actually matters if you're running an agency or building products.
- Wait for Seed Dance 2.0 US access on February 24th, then test it immediately. This is the first video model that's good enough to use in client campaigns without heavy disclaimers. Test it for social content, ad concepts, and UGC-style videos. See where it fits in your production pipeline. You're not replacing a full video crew yet, but you're also not limited to stock footage anymore.
- Run speed tests with GPT 5.3 Codex Spark. If you're already using Codex or other AI coding tools, test Spark for rapid prototyping workflows. The 20x speed boost changes how fast you can iterate on ideas. It's $200/month (ChatGPT Pro), but if you're building tools or prototypes regularly, the time savings might justify the cost.
- Evaluate open-source models for cost savings. GLM5 and Minimax M2.5 are performing at near-frontier levels for a fraction of the cost. If you're running agents that burn through API calls, switching to an open model could cut your costs by 90%. Test them on real workflows—data analysis, document generation, coding tasks—and compare output quality to what you're paying for now.
- Test Pico Claw for persistent agents. If you've been wanting to set up an always-on AI assistant that you can message via Telegram or WhatsApp, Pico Claw makes it accessible. It's cheaper, faster, and easier to deploy than OpenClaw. Use it for personal productivity, client notifications, or workflow automation.
- Ignore the hype around Gemini 3 Deep Think unless you need it. It's an incredible model, but unless you're solving theoretical physics problems or competing in coding Olympiads, you're paying $250/month for capabilities you won't use. Stick with the standard Gemini 3 or Opus 4.6 unless you have a specific scientific use case.
- Don't sleep on audio tools. SoulX Singer, MOSS TTS, and Just Dub It are all production-ready. If you're doing multilingual content, voiceovers, or localized marketing, these tools are cheaper and faster than hiring voice actors or dubbing studios. Test them on real client work and see where they fit.
Bangkok8 AI: We'll show you which models break reality—not just benchmarks.
