Laravel Just Benchmarked 6 AI Models. Here Is What the Data Says.
Laravel recently published their Boost Benchmark results, testing six AI models against real Laravel tasks verified by Pest tests. The benchmark was built by Pushpak Chheda and the Laravel team to measure how well AI coding agents handle production-level Laravel work.
At Cafali, we run Claude Sonnet for routine development and Opus for complex architecture decisions. That split has worked well for over a year. But seeing Laravel's benchmark data made us rethink some assumptions. This post breaks down their findings, adds our perspective from using these models daily, and shares practical recommendations for development teams.
What Laravel Tested
Most AI benchmarks test isolated code generation: give the model a function signature, see if the output compiles. Laravel's Boost Benchmarks framework does something different. It runs AI coding agents against real Laravel applications and verifies results with two criteria:
- Functionality: Does the implementation work? Verified by real HTTP requests and Pest test assertions against a running application. Not "does it compile" but "does it actually serve the correct response."
- Architecture: Does the code follow Laravel conventions? No debug artifacts like
dd()left in controllers, correct class inheritance, proper use of facades, no obvious production mistakes.
Each evaluation starts with an input/ directory containing a Laravel application skeleton. The runner copies it into a temporary workspace, points the AI agent at a prompt, lets it work, then merges a suite/ directory with Pest tests and runs them. Results get written as JSON with test outcomes, token usage, execution time, and total cost.
This is closer to how developers actually work: you have an existing codebase, a task description, and you need to produce code that passes a test suite a senior engineer wrote. If the application doesn't boot, every test fails. Just like production.
They tested six models: Haiku 4.5, Sonnet 4.6, and Opus 4.6 from Anthropic; Kimi K2.5 from Moonshot AI; and GPT-5.3 Codex and GPT-5.4 from OpenAI.
The Results
| Model | Tests Passed (with Boost) | Tests Passed (without Boost) | Delta | Test Accuracy | Avg Time (Boost) | Avg Time (No Boost) |
|---|---|---|---|---|---|---|
| Haiku 4.5 | 267/315 | 231/315 | +36 | 84.8% | 145s | 164s |
| Sonnet 4.6 | 297/315 | 296/315 | +1 | 94.3% | 179s | 208s |
| Kimi K2.5 | 298/315 | 270/315 | +28 | 94.6% | 108s | 122s |
| GPT-5.3 Codex | 313/315 | 298/315 | +15 | 99.4% | 191s | 175s |
| GPT-5.4 | 313/315 | 299/315 | +13 | 99.4% | 210s | 202s |
| Opus 4.6 | 301/315 | 275/315 | +26 | 95.6% | 217s | 275s |
Source: Laravel Boost Benchmarks
Here is how we read these results based on our daily experience with these models.
The Claude Lineup: Haiku, Sonnet, and Opus
The Anthropic models reveal a clear value curve that maps directly to how we use them at Cafali.
Haiku 4.5 (84.8% accuracy) handles simple tasks but breaks on complex evaluations. It passed 267 out of 315 tests with Boost enabled, and only 231 without it. That +36 delta is the largest improvement from Boost of any model tested, which tells you Haiku relies heavily on external context to compensate for its smaller parameter count.
We use Haiku for code explanations, documentation generation, and simple refactors where correctness is easy to verify visually. We don't let it scaffold features.
Sonnet 4.6 (94.3% accuracy) is our daily driver. It reached 297/315 tests passed and showed the smallest Boost delta (+1), meaning it already knows Laravel well enough that additional MCP context barely moves the needle on correctness. What Boost did improve was speed: average time dropped from 208s to 179s, a 14% speedup.
For teams that need one model for Laravel work, Sonnet is the answer based on both the benchmark data and our experience. The accuracy-to-cost ratio is unmatched.
Opus 4.6 (95.6% accuracy) reached 301/315 with the largest absolute test count among Claude models. Its Boost delta (+26) was significant, meaning complex tasks genuinely benefit from the additional framework context.
That 1.3% gap between Sonnet and Opus looks small on paper. In our experience, it shows up on specific task types: service provider wiring, complex Inertia shared data configurations, multi-step Livewire integrations, Pennant feature flag setups. These are the architectural decisions where a wrong choice cascades through the entire application.
We use Opus at Cafali when the cost of getting it wrong exceeds the cost of the model. Database schema design, authentication flows, multi-tenant architecture. For those decisions, the accuracy gap represents hours of debugging saved.
The Surprise: Kimi K2.5
Kimi K2.5 from Moonshot AI scored 94.6% accuracy, essentially matching Sonnet and nearly matching Opus, in just 108 seconds average. That is almost half the time of Opus (217s) and significantly faster than GPT-5.3 Codex (191s).
The Boost delta was +28, meaning Kimi benefits substantially from MCP context. Without Boost, it drops to 270/315 (85.7%), which puts it closer to Haiku territory. With Boost, it competes with Sonnet and Opus.
For teams that process high volumes of AI-assisted tasks and optimize for throughput, Kimi deserves evaluation. The tradeoff is ecosystem maturity: Anthropic and OpenAI have larger communities, more tooling, and more established enterprise support. But on raw Laravel performance, Kimi is competitive at a fraction of the time.
GPT-5.3 Codex: The Accuracy Leader
GPT-5.3 Codex and GPT-5.4 both hit 99.4% accuracy, passing 313 out of 315 tests. Two tests failed across the entire benchmark suite. GPT-5.3 was faster at 191 seconds versus 210 for GPT-5.4, making it the clear winner between the two: same accuracy, less time, lower cost.
This result caught our attention. We have been running Claude for all Laravel development for over a year. Seeing GPT-5.3 Codex at 99.4% on the same real-world tasks where Opus scores 95.6% is a 3.8% accuracy gap. On a 315-test suite, that is 12 additional tests passed.
We are planning to run GPT-5.3 Codex on a real client project to see if the benchmark results translate to our actual development workflow. Benchmarks measure capability; production measures reliability under real conditions.
Why Boost (MCP Context) Changes Everything
The most actionable finding from Laravel's benchmark is not which model won. It is that Laravel Boost, an MCP server that provides framework-specific context to AI coding agents, improved every single model tested.
When an AI agent encounters a Laravel-specific task, Boost provides relevant documentation, code patterns, and framework conventions through the Model Context Protocol. The agent does not need to rely on training data alone; it gets live, accurate context about how Laravel expects things to work.
The impact by model according to Laravel's data:
- Haiku: +36 tests (the biggest jump, from 231 to 267)
- Kimi K2.5: +28 tests (from 270 to 298)
- Opus: +26 tests (from 275 to 301)
- GPT-5.3 Codex: +15 tests (from 298 to 313)
- GPT-5.4: +13 tests (from 299 to 313, matching Codex)
- Sonnet: +1 test but 29 seconds faster (already knew Laravel well)
The improvements were most dramatic on complex tasks that normally send developers to the documentation: building custom MCP servers, configuring the Laravel AI SDK agent loops, setting up Pennant feature flags, wiring Inertia shared data, and integrating Folio routing.
The overhead is minimal. Across all runs, enabling Boost added roughly $0.05 to $0.20 per evaluation in token costs. At subscription pricing levels, this is negligible.
If you are writing Laravel with AI assistance and not using Boost, you are leaving measurable performance on the table regardless of which model you choose.
The Non-Determinism Problem
One finding from the benchmark that does not get enough attention: LLM behavior is non-deterministic, and the variance is significant.
Laravel observed Haiku 4.5 improving from 7/19 to 17/19 on the Livewire evaluation with Boost enabled, just by running the same evaluation again with identical inputs. GPT-5.3 Codex flipped from fail to pass on four evaluations in a single rerun. Same model, same prompt, same setup, different results.
This has practical implications for development teams:
- A single test run is not definitive. If you are evaluating AI models for your team, run each test multiple times and take the majority result.
- CI/CD with AI-generated code needs retry logic. A failing AI-assisted task might pass on a second attempt, which means your pipeline should accommodate retries before escalating to a human.
- Code review remains essential. Non-deterministic output means you cannot assume consistent quality. Every pull request, whether human or AI-generated, needs the same review rigor.
At Cafali, we treat AI-generated code the same way we treat code from a new team member: trust but verify. The benchmark data validates that approach.
The Failure That Teaches the Most
One result from the benchmark stood out: Haiku 4.5 scored 0 out of 13 on the MCP Server evaluation without Boost. Zero. Not a low score. Zero.
The model misconfigured a service provider. The Laravel application never booted. Every single test failed, not because the logic was wrong, but because the app could not start.
The generated code looked correct in isolation. The controller logic was sound. The routes made sense. But a single binding error in the service provider's register() method meant the IoC container could not resolve the dependencies, and the entire application crashed on boot.
This pattern is the most common failure mode in AI-assisted Laravel development in our experience. Common examples:
- A service provider that binds an interface to a class that does not exist yet
- A migration that references a column type incompatible with the configured database driver
- An Inertia page component registered with a path that does not match the Vue/React file structure
- A Livewire component with a
mount()method that type-hints a model the route does not inject
None of these errors show up in static analysis. They only surface when the application boots and tries to resolve the dependency graph. This is why we run every AI-generated feature through a full test suite before it touches a staging environment.
Practical Recommendations
Based on Laravel's benchmark data combined with our daily experience, here is how we think about model selection:
| Model | Avg Time | Accuracy | Best For |
|---|---|---|---|
| Kimi K2.5 | 108s | 94.6% | High-volume routine tasks, prototyping |
| Haiku 4.5 | 145s | 84.8% | Documentation, explanations, simple refactors |
| Sonnet 4.6 | 179s | 94.3% | Daily development (best value) |
| GPT-5.3 Codex | 191s | 99.4% | Maximum accuracy on critical features |
| GPT-5.4 | 210s | 99.4% | Same as 5.3 but slower (skip this one) |
| Opus 4.6 | 217s | 95.6% | Architecture decisions, complex integrations |
Our recommendation: use the fastest model that meets your accuracy threshold for the task at hand. Do not use Opus for a route definition. Do not use Haiku for a database migration.
What We Are Adjusting at Cafali
Based on the benchmark results, here is what we are changing:
- Keeping Sonnet 4.6 as our daily driver. The benchmark confirms it is the best value in the Claude lineup.
- Keeping Opus 4.6 for architecture decisions. The accuracy gap is small on benchmarks, but architectural context is where Opus justifies its premium in practice.
- Testing GPT-5.3 Codex on a real project. 99.4% accuracy on Laravel-specific tasks deserves a real evaluation.
- Making Boost mandatory. The data is clear: context improves every model.
- Watching Kimi K2.5. If speed matters more than the last 5% of accuracy, it is a legitimate option.
Three Questions for Your Team
If you lead a development team using AI coding assistants, the benchmark gives you a framework for making model decisions based on data:
- What is your accuracy threshold? If you need 99%+, GPT-5.3 Codex leads. If 94-95% is acceptable with human review, Sonnet or Kimi save time and money.
- Are you providing framework context? Boost improved every model. If your team is not using MCP-based context servers, you are getting worse results than you should be.
- Do you have a review process for AI-generated code? Non-deterministic output means consistent quality is not guaranteed. The Haiku 0/13 failure is a matter of when, not if.
The Bottom Line
The best AI model for Laravel depends on what you are optimizing for. GPT-5.3 Codex leads on raw accuracy in this benchmark. Kimi K2.5 leads on speed. Sonnet 4.6 offers the best balance for daily work.
But the single biggest improvement comes from giving any model better context. A capable model with relevant framework context outperforms a more powerful model working from training data alone. Invest in your tooling before you upgrade your model.
At Cafali, we evaluate our tools with the same rigor we bring to client projects. Building with Laravel and want a team that stays current? Let's talk.
All benchmark data in this post comes from Laravel's Boost Benchmarks by @pushpak1300 and the Laravel team. The evaluation framework is being open-sourced. Our analysis and recommendations reflect our experience using these models at Cafali and should not be taken as independent benchmark results.