Hey there! Part 2 of the “Better than Vibe Coding” series is live (finally lol, sorry for the delay!
No theory this time - just sharing the github and in the video I start a real app project following the Agile AI Driven Development flow, showing how to make the most of cursor agents for amazing results.
- Gemini Web 2.5 Pro for free deep‑research → project brief
- PM AI generates the PRD
- Architect AI lays out the architecture & stories
- Demo app kickoff: automated 2‑host podcast pipeline
- Full prompts & agent configs: GitHub - bmadcode/BMAD-METHOD: Breakthrough Method for Agile Ai Driven Development
Please check out the video to see how to use the prompts: https://youtu.be/1wQUio9TiIQ
Curious how you’d tweak the persona prompts or integrate additional checks—let’s compare notes!
2 Likes
Hey Brian –
Have you done any work on evaluating prompts using any kind of automated process?
I’ve started using a simple and pretty flawed system for figuring out how changes in prompts, rules, personas and modes affect the output quality of a product:
– use the same prompt across evals for a sample product, e.g. html->markdown converter (I like this particular one because it’s not possible to get to 100%, but there’s a massive range of how far you can get with it)
– run an eval changing one parameter each time, e.g. PRD vs no PRD, PRD with no persona vs PRD with persona, one PRD prompt vs another, one-shot vs task manager, gemini 2.5 vs claude 3.7 etc
– repeat x times (shamefully, x=3 until the tooling improves, which definitely results in anecdata but better than nothing)
– get a different model (I like o3 for this but o4-mini-high is working well too) to score the final outputs multiple times each
It’s tedious work for sure but the results have been eye-opening so far – I’m still compiling them but I can draw some early conclusions already, and I’ve noticed a big level-up in my own work by adopting the “winning” strategies.
If you’ve worked on anything like this I’d 100% love to get some hints on how to improve the eval process… would be great to see some kind of leaderboard eventually for different prompts, rules and modes.
This is a great systematic approach of AB testing to see incrementally what works and does not. There are so many variables at play from type of project, scope, technology choices, prompt wording, and ever evolving models that are improving it can be tough to figure out quickly what the minimal set of lean rules, prompts and docs are needed for consistent success.
Automating this testing somehow would be really a great idea!
Bmad thanks so much for putting this out there! It works great! I am going to build a DevOps agent that will guide user to get basic devops pipeline implemented. It will suggest platform agnostic solutions so can go to AWS, Azure, GCP, DigitalOcean, VPS seamlessly. It will suggest that every PR have 1 reviewer. The basics. best practices. What do you think Bloco
Thanks awesome idea! Please feel free to PR it against the latest version V2 of the repo!