I spent some time this weekend trying to resolve E2E test issues. I’m using Playwright, but given my experience I suspect I’d have run into the same issues with Cypress. I started out using gpt-5 (thinking) for this and it had a LOT of trouble, then claude-4-sonnet which did better, but was still not doing great. So I finally switched to claude-4-sonnet (thinking) which handled ALL forms of automated testing overall quite well.
At least, initially. Unit testing it was bang-on, and produced excellent unit tests. (For the most part, so does gpt-5.) However, as my E2E test suite gained complexity, even c4s
started having problems. I at first thought maybe there was too much context (had one chat go pretty long), so I started creating new ones with clean context. The issues, which had to do with flaky tests and tests that simply couldn’t get past certain stages (mostly just logging in, but a couple couldn’t get past some later stages of the ui workflows), no matter what I had the agent do.
I started checking out the code myself, and while at a syntactical level it was clean, from a test design standpoint it was an absolute disaster. No two test suites were designed the same way, they each made different assumptions, some used this architecture of setting up fake auth data first, some actually tried to log in, every single test case was using different route mocks, etc. When I tried to have the agent share some of the route mocks, that fixed some problems but then a whole slough of other problems cropped up and different tests started failing.
I spent a while, trying to sort this out, several hours. I even went with a clean slate, put together E2E testing rules, and started creating isolated test suites for each aspect of the web site to test, and once multiple suites were in play, the agent just couldn’t seem to handle it all.
I decided against continuing, and tabled the entire E2E test suite for now. However, the experience made me wonder: Are LLM’s currently at some kind of disadvantage when it comes to E2D testing? The sheer chaos in the DESIGN of the test cases, was what really caught me. GPT or Claude, neither seemed to really grasp the concept well enough to avoid problems running the full suite (running individual test cases and sometimes individual suites would work.)
I wonder if this is a…context or scope problem? Can an agent, actually grasp the “WHOLE” here, well enough to actually produce a comprehensive E2E set of test suites that can run together without issues? Can an LLM even grasp one portion of a complex web app (internal administrative sites, in this case) to properly understand what needs to be done to ensure an entire workflow, from logging in, through navigating to the desired functionality, utilizing that functionality, and then handling both success and failure cases, all in one go? Or is this just a level of complexity that is beyond current LLMs?