End to End Testing... A tricky area for current gen LLMs?

I spent some time this weekend trying to resolve E2E test issues. I’m using Playwright, but given my experience I suspect I’d have run into the same issues with Cypress. I started out using gpt-5 (thinking) for this and it had a LOT of trouble, then claude-4-sonnet which did better, but was still not doing great. So I finally switched to claude-4-sonnet (thinking) which handled ALL forms of automated testing overall quite well.

At least, initially. Unit testing it was bang-on, and produced excellent unit tests. (For the most part, so does gpt-5.) However, as my E2E test suite gained complexity, even c4s :brain: started having problems. I at first thought maybe there was too much context (had one chat go pretty long), so I started creating new ones with clean context. The issues, which had to do with flaky tests and tests that simply couldn’t get past certain stages (mostly just logging in, but a couple couldn’t get past some later stages of the ui workflows), no matter what I had the agent do.

I started checking out the code myself, and while at a syntactical level it was clean, from a test design standpoint it was an absolute disaster. No two test suites were designed the same way, they each made different assumptions, some used this architecture of setting up fake auth data first, some actually tried to log in, every single test case was using different route mocks, etc. When I tried to have the agent share some of the route mocks, that fixed some problems but then a whole slough of other problems cropped up and different tests started failing.

I spent a while, trying to sort this out, several hours. I even went with a clean slate, put together E2E testing rules, and started creating isolated test suites for each aspect of the web site to test, and once multiple suites were in play, the agent just couldn’t seem to handle it all.

I decided against continuing, and tabled the entire E2E test suite for now. However, the experience made me wonder: Are LLM’s currently at some kind of disadvantage when it comes to E2D testing? The sheer chaos in the DESIGN of the test cases, was what really caught me. GPT or Claude, neither seemed to really grasp the concept well enough to avoid problems running the full suite (running individual test cases and sometimes individual suites would work.)

I wonder if this is a…context or scope problem? Can an agent, actually grasp the “WHOLE” here, well enough to actually produce a comprehensive E2E set of test suites that can run together without issues? Can an LLM even grasp one portion of a complex web app (internal administrative sites, in this case) to properly understand what needs to be done to ensure an entire workflow, from logging in, through navigating to the desired functionality, utilizing that functionality, and then handling both success and failure cases, all in one go? Or is this just a level of complexity that is beyond current LLMs?

Testing in general is terrible from LLMs. They are obsessed with writing tests to match the code - whatever the code does. Which is the dumbest possible thing.

I’ve had very bad luck with even writing tests before the logic, where I then try to have the LLM implement to meet the tests - they will almost invariably try to change the test to match the code it wrote wrong. Or you specify the test, and it immediately writes the logic and then the test to it. Even if you tell it to use test-driven development. It just doesn’t want to do it. Its obsessed with writing tests to match the code regardless of what the code is supposed to do.

Yeah, I’ve had this problem with more than just LLMs. :winking_face_with_tongue: Juniors do that, too. :smiley:

However, I understand what you are describing. Its not unexpected either. Without EXTREMELY good specifications, its tough to have anyone generate tests before code in any context. I have been getting better results lately, having implemented a number of rules around unit testing. Its not perfect, and it is an ongoing journey, but I’ve made some key headway in the last few weeks.

A big factor was explaining, and codifying, the meaning of “pass CORRECTLY.” I have long used this term with juniors I mentor, and now I’m using it with the agent+LLMs. There is passing, and there is passing CORRECTLY. The latter is what you want. It is entirely possible for tests to pass incorrectly, or pass for unexpected reasons. I’ve been using this term consistently since I first started using Cursor, and I think the agent is starting to pick up on its real meaning, and it seems to be actually accounting for this fundamental concept as it generates tests. I often still have to nudge it to actually factor it in, but I have noticed several times now that its actually mentioned passing correctly, and then re-iterated over tests it originally implemented to redesign them according to specific story requirements.

It wasn’t something that happened overnight, though…this has been a couple-months long journey to actually seeing the agent+model itself, mention passing correctly on its own. I don’t think this is because the model itself has incorporated any of this…I think its because Cursor maintains long term memory of all your chats, and somehow that is factored in to future behavior (there is, at the very least, a very extensive database of prior chat content called memory in the local data Cursor collects.)

It ain’t just “juniors”, I’ve seen it from “seniors” from toptal… who are just juniors playing make believe.

Sadly, unit testing is not a widespread dev skill, top to bottom. Throughout my career, I’ve met FAR more devs, senior on down, who hated testing…than those who liked it, and barely a handful of devs who really actually understood it, loved it, loved TDD, etc. Its not a common trait, IME, and if more devs than not hate testing, they won’t really have skill in it.

To Pass CORRECTLY!

Should be a freakin book! :stuck_out_tongue:

The problem is testing is you get some Jira Jockey project manager in the way, and someone demanding 100% code coverage. So you end up writing tests to satisfy both of them - get stuff done fast and make it report 100% code coverage. Which does nothing. Then devs never see the actual value in testing.

It sure is nice when you change that one thing that seems innocuous - and then see tests breaking all over because you had no idea it would cause that to break.

But it is entirely possible to write useless tests. It also requires more thinking and planning to write testable code. Even if I am not writing comprehensive tests on a project, I’m now at the point I’m writing testable code by default - and unsurprisingly it also makes for better code thats more easily maintained. Now just if LLMs understood that too!

Such is the goal of TDD. When you write the tests first, the code has to be testable. :stuck_out_tongue:

Thankfully no one on the project I’m working on now, is a jockey. Most are vibe coders, some are more like me, agentically assisted. :slight_smile: So we don’t have the pressures to generate 100% coverage or any of that. In fact, I think most of the vibers just test manually and don’t have much in the way of automated tests. I like unit tests, as they do catch issues…not as much, as if I were indeed following TDD, but, enough.

Sadly, I haven’t really figured out a way to integrate a true tests-first approach with an agent. The agent+model, don’t really seem to grasp that concept well. They are grasping more of the “pass CORRECTLY”, but the journey isn’t over yet.

Some if it is planning. The stories I’m creating today are good enough to keep the agent on the rails, but, I don’t think they capture enough of the requirements or acceptance criteria, to actually support a real test-first TDD approach.

As such, the tests do sometimes get biased by the implementations. We are moving too fast for me to be able to resolve that any time soon, though.

Not sure if this is an E2E test, but I’m currently privately writing a data handler for an Android smartphone that receives data over the mini-jack. There’s an old algorithm in Java, a Python version came together pretty quickly, but writing the Kotlin implementation has been dragging on for a long time and it’s not clear what’s actually blocking it.

For debugging I wrote a PowerShell script — it runs linters, builds the app, launches it over USB, waits a timeout, kills the app, pulls test files and logs from the device and analyzes them. Sometimes the Agent reads the logs itself (there’s a lot of info there and it can change its output, so I didn’t bother making the script auto-process them). This allows the Agent to work on the task autonomously for some time.

All the code was written through Agent Chat. But I came up with the concept of this system — it wasn’t the Agent that decided to do it that way.

I haven’t read the tests that were written, but for another Python project I also made a deep parameterized integration test suite of 1,800 tests. The code and fixtures there were written by an AI (either Claude or Gemini), but again I invented the concept, not the Agent.

So I think it’s a matter of approach.

Also, you didn’t mention Grok 4 and Gemini 2.5 Pro, which handle tasks more comprehensively than the others. Back to the Kotlin project — I created a Python sandbox in the repo to debug the algorithm, and only Grok 4 on its own decided to check the fresh data coming from the device in the sandbox without me mentioning the sandbox in the prompt.


I had no problems using TDD. On the contrary, I successfully used it when I couldn’t solve a problem quickly. But I don’t think the methodology is always the optimal choice. Unless you write the tests yourself and tell Agent to run them. Still, I think you have to be a pretty good programmer for manually writing tests to make sense instead of just prompting the model to write tests.

By the way, I couldn’t write tests at all before working with the Agent (and I could hardly write code either…). Luckily I learned on that very project, where the test suite eventually grew to 1,800 items.

Also, high-coverage test suites (which the Agent can write itself) help avoid regressions when the AI accidentally(?) tries to break something that already works.

I love E2E testing with LLMs.
I brought the best practices, now they are creating , updating and running the e2e tests.
Using this to validate their own work!
The results are impeccable.

I had all these same issues → I ended up building an MCP Server to solve these things:

  • Getting the AI coding agent to login: with Playwright MCP this is a nightmare. You waste so many tokens just trying to get it to understand the auth file path (and base url) as it greps around.

  • Generating locators: Playwright MCP (by design) uses only the a11y tree and doesn’t use the DOM. This means unless your entire application is written in clean semantic HTML, you won’t be able to generate locators that work without lots of debugging.

  • Use more than one role: If you’re trying to author e2e tests across multiple roles, like a an app in which one user impacts another user and you need to verify in a test that this flow is not broken (Admin takes an action, User sees that action and takes an action etc) this is almost totally impossible with Playwright MCP.

Just sharing the MCP JSON should demonstrate how my MCP server works:

{
  "mcpServers": {
    "verdex": {
      "command": "npx",
      "args": [
        "@verdex/mcp@latest",
        "--role", "admin", "/path/to/admin-auth.json", “default_url_here”,
        "--role", "user", "/path/to/user-auth.json", “default_url_here”
      ]
    }
  }
}

The role, auth path and default url are baked into the browser at the CDP level. The coding agent doesn’t need to understand any of those pieces. I just gave it a simple tool: select_role(“user”) → and the coding agent doesn’t need to grep and burn tokens and time trying to understand any of this - it just works cleanly.

This means the coding agent can switch between roles in the same continuous chat: select_role(“user”)select_role(“admin”)select_role(“user”) - and the auth, role isolation and default url is all handled under the hood. Now the LLM can do what you want.

There’s a second set of problems that occur with Playwright MCP that are all related to the same underlying problem: the coding agent doesn’t understand how to write e2e tests. It will write non idiomatic Playwright code that doesn’t use built in Playwright features. It will use manual timeouts etc when it could use built in auto wait features etc etc.

I solved that with a knowledge layer - in cursor rules that means rules that are triggered at certain time, and in Claude Code that means skills triggered by certain workflows. This solved 90% of my problems.

It’s still experimental but if it sounds interesting please check it out - would love any feedback good or bad:

Here’s a complete write up if you have any interest: Why AI Can't Write Good Playwright Tests (And How To Fix It) - DEV Community