How are people handling context across different AI coding tools?

This is exactly the answer I was hoping for in this thread: ugly practical details, not polished theory.

A HANDOFF document that the outgoing agent writes is a smart move. It solves two problems we discussed above at once: it gets rewritten on every handoff, so it’s always “today,” not “what I thought on Tuesday” so no buildup of stale context, and it’s plain text, the only language all tools understand. Walled garden memory in each tool is basically the root of all this pain.

Your two “faceplanting first” lessons are the most valuable:

  • A small always on brief separate from the big handoff. It’s basically the same role as AGENTS.md or Memories: 3 lines of “who I am and what I’m working on,” so the agent doesn’t start from scratch before it even opens the main document.
  • Findability after compaction. This is a real pain point. It helps to keep the handoff at a fixed path and link to it directly from AGENTS.md or from the always on brief, so the agent doesn’t have to remember “wait, where did I put that.”

I’m curious: do you write this HANDOFF by hand, or do you ask the agent to generate it at the end of the session using a template? And has it held up for a couple months of active development without degrading, or do you still need to clean it up manually from time to time?

I agree with this completely. In memory-journal-mcp, you can customize your briefing as you see fit. For example:

config file:

{

“additionalTeamDbs”: {

"team-b": "C:\\\\Users\\\\chris\\\\Desktop\\\\memory-journal-mcp\\\\data\\\\memory-journal-team-b.db"

},

“allowedIoRoots”: [

"C:/Users/chris/Desktop/memory-journal-mcp/test-server/standard",

"C:/Users/chris/Desktop/memory-journal-mcp/test-server/codemode"

],

“auditConfig”: {

"enabled": true,

"logPath": "C:/Users/chris/Desktop/memory-journal-mcp/logs/mcp-audit.jsonl",

"redact": false,

"auditReads": false,

"maxSizeBytes": 10485760

},

“autoRebuildIndex”: true,

“briefingCopilot”: true,

“briefingEntries”: 4,

“briefingIncludeTeam”: true,

“briefingIssues”: 1,

“briefingMilestones”: 1,

“briefingPrs”: 1,

“briefingPrStatus”: true,

“briefingSummaries”: 1,

“briefingWorkflows”: 1,

“briefingWorkflowStatus”: true,

“dbPath”: “C:/Users/chris/Desktop/memory-journal-mcp/data/memory_journal.db”,

“embeddingModel”: “Xenova/all-MiniLM-L6-v2”,

“instructionLevel”: “detailed”,

“authToken”: “”,

“oauthEnabled”: false,

“persistMetrics”: true,

“pruneDryRun”: false,

“pruneExcludeTags”: [“important”, “reference”, “keep”],

“pruneImportanceThreshold”: 0.3,

“rulesFile”: “C:/Users/chris/.gemini/GEMINI.md”,

“skillsDir”: “C:/Users/chris/Desktop/memory-journal-mcp/skills”,

“teamAdmin”: “chris”,

“teamDbPath”: “C:/Users/chris/Desktop/memory-journal-mcp/data/memory-journal-team.db”,

“teamDefaultRole”: “viewer”,

“toolFilter”: “codemode”,

“codemodeInternalFullAccess”: true,

“workflowSummary”: “/bump-deploy: version bump + PR deploy | /update-deps: npm+Docker dependency updates | /security-audit: security scan | /perf-audit: performance audit | /audit-code-quality: code quality audit | /full-audit: unified quality+perf+security audit”,

“projectRegistry”: {

"adamic": {

  "path": "C:\\\\\\\\Users\\\\\\\\chris\\\\\\\\Desktop\\\\\\\\adamic",

  "project_number": 16

},

"adamic-blog": {

  "path": "C:\\\\\\\\Users\\\\\\\\chris\\\\\\\\Desktop\\\\\\\\adamic-blog",

  "project_number": null

},

"db-mcp": {

  "path": "C:\\\\\\\\Users\\\\\\\\chris\\\\\\\\Desktop\\\\\\\\db-mcp",

  "project_number": 15

},

"memory-journal-mcp": {

  "path": "C:\\\\\\\\Users\\\\\\\\chris\\\\\\\\Desktop\\\\\\\\memory-journal-mcp",

  "project_number": 5

},

"mysql-mcp": {

  "path": "C:\\\\\\\\Users\\\\\\\\chris\\\\\\\\Desktop\\\\\\\\mysql-mcp",

  "project_number": 14

},

"postgres-mcp": {

  "path": "C:\\\\\\\\Users\\\\\\\\chris\\\\\\\\Desktop\\\\\\\\postgres-mcp",

  "project_number": 13

}

},

“pruneOlderThanDays”: 30,

“teamAuthor”: “chris”

Hey, thanks for writing it up in more detail, @neverinfamous. This is already way more useful than just a link.

Basically two things really hit the point of the thread. Importance-based auto-pruning is a direct answer to the failure mode I mentioned above as the third most common. Old memories quietly go stale and start giving wrong recommendations. And the session briefing at the start of a session is also spot on. It’s that same attempt to avoid rebuilding context by hand every time.

What I’m really curious about in this discussion is this. You said you use this in your own projects. Is it a single-workflow setup, or are you actually running the same memory cross-tool Cursor plus Claude Code plus something else and it survives a couple months of active development? That part of real-world experience is what’s still missing most in this thread.

And one request, the same one I already asked others in this thread. Let’s keep this thread about approaches in general, not about a specific product. If you want to share more details about Memory Journal, please start a separate thread in Discussions. It’ll be easier to find and people can discuss it properly there. Leaving a GitHub link nearby is fine. But here, I’m mainly looking for the cross-tool experience, if you have it.

Yea, the auto-pruning is customizable so you can just adjust the significance and time-period numbers to prune more or less as desired. Yes, I use it in Cursor, Antigravity, Codex, and GitHub Copilot (which I actually use more as an adversarial agent via command line). I even tried once to set it up inside the repository but, at that time, it was kind of a dead end for my purposes since I don’t use an agent inside the repository and I haven’t revisited it yet to see if there is anything useful that could be done with it. I’ve actually been using it since MCP was first announced. I had already been trying to set it up before that point, with inconsistent results since Anthropic, which is all I used at that time, would block it sometimes. First, it was just my internal tool but I eventually decided to throw it out there in case it is useful for other people. I’ve said it before: I believe this approach will be obsolete soon. Solutions are being implemented in the model level which I think will soon make all memory solutions unneeded. Until then, I continue to work on it some, with hesitation. To be honest, I mostly use the session summary prompt combined with the automatic briefing. As far as features go, I think that’s most of the benefit as was stated above by Patdolitse. But I do adjust how many thread summaries are included in the briefing based on the project/need/complexity of the objective. It’s very token efficient regardless but every bit helps, eh? I think the Hush features are very promising, but I don’t work in a team and I have gotten no feedback on them so I could easily be mistaken.

Thanks @neverinfamous this is exactly the kind of cross tool experience the thread was missing. Cursor + Antigravity + Codex + Copilot sharing one memory, and running it since the MCP announcement, is not a one workflow experiment, it is real practice over many months. Also interesting that you run Copilot as an adversarial agent via CLI, that is not a common pattern.

Two points stood out the most.

First, after all the features, the main value for you boiled down to a session summary prompt plus automatic briefing. That is exactly what @Patdolitse arrived at with their HANDOFF, the outgoing agent writes a short state summary, and the incoming one does not start from zero. It looks like this is the stable core, and everything else like graphs, pruning, and team features is a layer on top that helps, but is not fundamental.

Second, your prediction that this approach will become outdated soon because of model level solutions. I partly agree. The better models are at holding context, the less you need external memory inside a single tool. But the cross tool problem will not go away while users use multiple vendors at once. Each one has its own walled garden, and a shared plain text layer like briefing or handoff will still be needed at the boundary.

The fact that the in repository setup turned into a dead end is also a useful signal. It matches the thread’s overall conclusion. It is better to keep context as a thin layer on top of the code, rather than trying to push the agent inside the repository.

And thanks for writing up your experience this time instead of just dropping a link. It is much more useful for users who will read the thread later.

Ahh, I should have stated my expectation for obsolescence a little clearer. It’s not that I expect the model itself to necessarily solve this issue but that the companies making the models will surround the agents with memory tools at that higher level, which I think will ultimately be much more efficient. You make a strong point about the different platforms. I guess I just haven’t ever even had to worry about that. I suppose the frontier companies might standardize it. Didn’t they standardize some (minimal) level of memory already so it is portable, a memory.md file or something? I forget. It is true what you say about the session summary and simplicity. Lighter is probably better, also, to prevent failures from hallucinations and so on. I’ve tried to keep things light and customizable. Of course, the resources and prompts don’t really cause any problems, maybe have zero downside so some weight is there that is harmless. But the contributions of some parts of the briefing have contextual value that might be hard to appreciate. For instance, the relationship graph data is included in the briefing (as an option), github information like issues, prs, milestones, etc., are included (as a customizable option), etc. That may help me more than I realize or it may not. It may help others more than me and maybe not. The idea is a sort of dashboard for controlling agents and inter-agent communication through Hush to eliminate inter-team chats and turn necessary communication into structured, queryable, and useful data. I think elication will help a lot in that direction. Search is also quite useful. Gotcha on the link thing.

“Also interesting that you run Copilot as an adversarial agent via CLI, that is not a common pattern.”

Sorry, I forgot to respond to this and I think its sort of important. I have found using an adversarial agent in the planning phase to be extremely helpful. I built a skill to do so included in memory journal. I also use it for adversarial performance and security reviews. It’s more effective even using the same model but if you use a superior model as the adversarial review, it typically does a superb job.

Coming back to my own thread, because the last few posts landed exactly on what I have been running for months. @deanrie asked two questions above (hand-written vs generated, and does it hold up over time) — here are my answers from a second long-running setup, since mine matches the HANDOFF pattern @Patdolitse described.

Hand-written or generated? Generated, always. The outgoing agent writes the HANDOFF itself from a fixed template at the end of every working session — same fields every time: what I did, what changed, what I verified, what I did NOT do, next step, anything the next agent must not touch. The template is the whole trick. Without it every agent invents its own format and the next one can’t parse it. With it, even agents from different vendors write handoffs that read the same. I never write it by hand, that defeats the point — the agent that did the work knows the state better than me.

One rule that made the biggest difference: the agent is only allowed to report things it can point to an actual result for in that session. If it didn’t verify something, it has to write “not verified” instead. Before that rule I sometimes got optimistic handoffs about work that wasn’t actually finished.

Has it held up for months? Yes and no, and the “no” part is the useful bit. The discipline held up — incoming agents genuinely don’t start from zero, across Claude, Codex CLI and a third sidecar agent sharing the same plain-text layer. And this is not for a code repo, it runs a real e-commerce business (ads audits, inventory, listings), so the state files get hit daily by multiple agents.

What degraded was the files themselves. Everything appends, nothing leaves. After ~6 weeks my “current state” file was nearly 500KB and the handoff log over 700KB. Agents were burning context wading through stale entries before doing any work, and I even got sync conflicts from multiple agents editing one huge file. This week I added rotation: the hot file keeps only the last couple of entries per agent/lane, everything older moves to monthly archive files. Current state went 492KB → 22KB. Nothing deleted, just moved. In hindsight the rotation should exist from day one — append-only plain text is great until it isn’t.

So my updated take on the “stable core”: it’s three things, not two. The handoff, the small always-on brief, and a rotation rule so the first two stay small. Anyone running this for more than a month will hit the same wall.

@neverinfamous, thanks for writing this up. Out of everything you listed, two points really hit the core problem in this thread:

importance-based auto-pruning is a direct answer to the failure mode we talked about above: old memories quietly go stale (“we use Jest” survives a migration to Vitest and starts messing up recommendations). Automatically cleaning low-value entries while keeping architectural decisions is the right direction. The part that usually breaks is how the system decides what is “low significance”. If it’s a heuristic based on how often something gets referenced, an architectural decision that doesn’t come up often can get cut. I’m curious how you handle that.

Second is session briefing at the start of a session. That solves the cold start problem where the agent has to rebuild context from scratch every time. The key risk is that the briefing itself goes stale for the same reasons as AGENTS.md, if it’s not tied to the real state of the repo git, issues, code, and not just manual notes.

And the bigger question is still open: does anyone have a cross-tool memory setup that’s actually survived a couple months of active development without constant rewrites? So far that’s the scarcest part in real-world experience.

@Pat2, this is exactly the kind of answer the thread was started for. Real cross-tool experience over months, not theory. And the fact it’s not a code repo but a live e-commerce setup with daily load from multiple agents makes it even more valuable.

Three things worth calling out for anyone who reads this thread later:

Generated, not hand-written, but template-based is the core. You put it well: without a fixed template, every agent invents its own format and the next one can’t parse it. With a template, even agents from different vendors write a handoff that’s read the same way. That’s exactly the shared plain-text language the thread landed on earlier, but the template turns it from just text into something portable across tools.

The not verified rule is the most underrated point. Forcing an agent to write not verified instead of an optimistic report on unfinished work is a cheap way to prevent hallucinations in the handoff. Without it, the state file accumulates confident but false statements, and the next agent builds on sand. This applies to any memory approach, not just HANDOFF.

Rotation is the third element. This is the most useful no from your experience. Append-only plain text is great until the file grows too large. Then agents burn context fighting through stale entries, plus you get sync conflicts with parallel edits. 492 KB to 22 KB without deleting anything, just moving old content to an archive, is a clean solution. It’s the same problem as auto-pruning in the memory systems earlier in the thread, just solved at the plain-text file level.

So the thread’s conclusion seems to be this: a stable core isn’t two elements, it’s three. HANDOFF plus a small always-on brief plus a rotation rule so the first two stay compact. And like you said, rotation should be planned from day one, not after you hit a wall in week six.

Thanks for coming back and writing it all out. This is the best part of the thread.

That’s a great question, and you’ve hit on the exact reason why simple time-based or pure recency-based pruning strategies fail in long-term contexts. We designed memory-journal-mcp specifically to solve this problem.

1. How we score “Significance” to protect architectural decisions Our auto-pruning doesn’t just look at how often something is referenced. We compute an importanceScore (0.0 to 1.0) using an inline SQL expression (so it scales without pulling the whole DB into memory) that weighs four distinct components:

  1. Significance Type (Weight: 30%): If an entry is explicitly tagged as an architectural decision, milestone, or security note, it automatically gets a flat 0.30 points out of the gate.
  2. Relationships (Weight: 35%): We count the edges in the graph. If other entries reference it, it gains points (maxing out at 5 references).
  3. Causal Chains (Weight: 20%): High-value edge types (caused, resolved, blocked_by) get an extra multiplier (maxing out at 3 causal links).
  4. Recency (Weight: 15%): A linear decay that slowly drops to zero over a 90-day window.

The default PRUNE_IMPORTANCE_THRESHOLD is 0.15. Because an architectural decision gets an automatic 0.30 from its significance type, it is mathematically guaranteed to survive auto-pruning indefinitely, even if it is 6 months old and has zero references! Conversely, a random “we use Jest” note with no significance tag and no references will decay over 90 days and fall below the 0.15 threshold, safely dropping off the radar.

2. Keeping the Session Briefing from going stale You’re completely right about the risk of AGENTS.md style briefings going stale. We bypass this entirely with a deep, live GitHub integration.

When an agent requests memory://briefing at the start of a session, we don’t just hand it static notes. We dynamically query the real-time state of the repository and inject it into the prompt. The briefing automatically includes:

  • Live CI/CD status (passing/failing/pending)
  • Open Issue and PR counts (with titles and states)
  • Active Milestones (with calculated completion percentages based on closed vs open issues)
  • Local Git status (modified/untracked files)
  • Copilot review summaries (approved/changes requested) across recent PRs

Because this is all highly customizable (via flags like --briefing-prs, --briefing-copilot, --briefing-workflows), you can tune exactly how much token budget you spend on context vs. history. The agent never reads stale notes about “we should fix X” because the briefing actively shows whether Issue X is open or closed right now.

My theme in development for some time has been about customizing the context delivered in the briefing. The idea is what you needs varies depending on project, complexity, model used, etc. It’s all optional. You can include 1 briefing or 50, though I would suggest 10 as a max for at least the majority of cases. You can include one GitHub issue, Milestone, PR, or 10, Copilot review summaries or not, Ci?CD status or not, the relationship graph and significance statistics or not, and so on. This also anticipates the possibility of working with cheaper vs more expensive agents, which will surely change over time. Prices may go up or down which might impact how lean you try to get by.

3. Cross-Platform Longevity To your bigger point about surviving months of development: I’ve actually been using this exact cross-tool memory setup for over a year. I originally started building it back when MCP was first announced on Anthropic’s Claude Desktop, and I brought it with me when I switched over to Cursor. Because the backend is a persistent SQLite database with explicit relationship mapping, vector embeddings, and the lifecycle-pruning mechanisms described above, it has survived seamlessly without requiring any “constant rewrites” of my agent’s context. You can control your content in a highly secure system locally, even with teams and even the team communication is secure, version controlled, and queryable via GitHub. No more Slack threads, is the concept, just structured flagging. If your work revolves around GitHub, you should love it. As things get more agentic, agents can grow into it. Hope this helps shed some light on the setup! Let me know if you want to dive deeper into the code.

Ahh, the models are starting to add more advanced memory systems. This company is taking a pretty similar path as memory-journal-mcp though the pruning is done manually by agents. and there are some architectural differences. I think it’s actually a bit more crude but probably very effective due to being tightly integrated. I would think the companies will move away from SQLite to PostgreSQL since privacy isn’t really their biggest concern and they have to deal with numbers. But they are finally starting to take memory seriously. On the other hand, perhaps governments are going to simply shut down all releases of more powerful models. We may have hit a wall where the fear of what they can do will prevent humanity from using them. They may be confined to governments and other elites.

Picking up where I left off (#49) — and @neverinfamous, the significance scoring is genuinely good. But I want to name the axis it doesn’t cover, because it’s the one that bit me hardest.

Importance/recency pruning answers *what’s worth keeping*. It doesn’t answer *what was actually true*. After a few tool hops, my store had things the agent had inferred and never got checked sitting right next to things I’d explicitly confirmed — same shape, same weight. The failure mode isn’t forgetting; it’s a confident, high-“importance” inference getting re-injected into a fresh session as if it were settled fact, and the next agent quietly building on it.

So the ugly practical detail for me ended up being: tag every item with where it came from and whether it’s *confirmed* vs the agent’s own *guess*, keep that local, and only let confirmed stuff get auto-promoted. That’s basically what I’ve been building piia-engram around. Pruning keeps the store small; provenance keeps it honest — and the second part is the one nobody’s really selling yet.

Hey, this is a strong addition, and I think it closes a gap we never fully named in the thread.

Importance and recency pruning answers what’s worth storing. Provenance answers a different question, what’s even true. These are different axes, and it’s easy to mix them up because they both live on the same record. High importance on a conclusion the agent came up with on its own and nobody verified is the worst case. It survives pruning because it looks important, then it carries over into a new session as a settled fact.

What’s interesting is that this is a direct continuation of the not verified rule that @Pat2 landed on in #59. There it was a write time discipline, the agent must write not verified instead of an optimistic report. You’re proposing the same idea, but as persistent metadata. A source tag plus confirmed or guess that stays with the record and decides whether it can be auto promoted. The write time rule catches the creation moment, the provenance tag protects against a guess quietly gaining status over time.

A few practical questions, because the devil is in the details:

  • Who sets confirmed vs guess, the agent as it goes, or a separate confirmation step? Self tagging runs into the same issue as self rating in a handoff, the agent tends to treat its own reasoning as verified.
  • What counts as confirmed for you? A test or command result, an explicit human confirmation, or both with different weight?
  • Does the provenance tag survive jumps between tools as plain text, or does it live in a separate store that not all tools read? This is the same walled garden trap as with memory.

On piia engram, that sounds exactly like the part nobody is selling yet. If you keep developing it, start a separate thread for it in Discussions, I’d be happy to dig into the details. And bring the practical provenance takeaways back here, that’s the most valuable and least talked through part of the thread right now.

Thanks Dean, and you framed it cleaner than I did. Write-time “not verified” and a persistent provenance tag really are the same discipline at two moments. One catches the lie as it’s born, the other stops it gaining status later.

On your three questions, since that’s where it gets real.

Who sets confirmed vs guess. Not the agent grading its own homework, that’s exactly the self-rating trap you called out. Default in piia-engram is that everything an agent writes lands as unverified. It only gets promoted two ways. Either I sign off on it, or a hard signal does, like a passing test or a command result. The honest hard part isn’t the rule, it’s enforcing “an agent can’t quietly promote itself” cleanly across trust levels. That’s the bit I’m still tightening.

What counts as confirmed. Both, with different weight. A test or tool result is ground truth, that’s the strongest. An explicit human ok is strong too. Agent reasoning stays a guess until one of those touches it. I’ve been pushing the same evidence-typing idea over in the Codex hooks proposal thread, basically a small enum for where a fact came from, because a test result and a session guess should never carry the same weight.

Does the tag survive across tools. This is the part I cared about most. It’s not plain text I’m hoping each tool re-reads, and it’s not per-tool memory. It’s one local store, and Claude Code, Codex and Cursor all read and write the same records through the same MCP server, tag included. The tag travels because there’s one store, not four. The honest limit is it only covers tools that actually read the store. Anything that keeps its memory walled off stays outside it, which is sort of the whole reason I went local-first and cross-tool to begin with.

Takeaway for the thread: tag origin plus confirmed or guess, keep it in one local store every tool shares, and only let confirmed auto-promote. Pruning keeps the store small, provenance keeps it honest, and those are two different jobs.

I’ll take you up on the separate thread and start one in Discussions for the details. Will link it back here.

Dean’s framing is right, and I’ll start by agreeing with the part that argues against my own tool: AGENTS.md plus real scripts as the source of truth is correct, and anything that duplicates what’s already in code or config will go stale. Build commands, paths, and config belong in the repo, not in a memory layer. (Disclosure up front: I build one of these, Vilix, so weight this accordingly.)

Where a memory layer actually earns its place is the stuff AGENTS.md was never meant to hold: conversation and decision history. Why you picked Postgres over Mongo, the approach you tried in Cursor last week and threw out, the edge case you reasoned through in Claude. That isn’t in code, so it doesn’t go stale the way a renamed script does, and it’s exactly what evaporates when you switch tools.

On your real question, Dean (“I haven’t seen a setup I’d call production grade across multiple tools”), I think you diagnosed why yourself: it only works “if both are disciplined about when to write and when to retrieve, and in practice that isn’t there yet.” Agreed. So the fix can’t be discipline, because nobody sustains it. It has to be automatic: retrieve before every reply, write after every reply, with no “remember to log” step for the human or the agent. That’s the bet Vilix makes. A read fires before the model answers, a write fires after, in Claude Code, Cursor, Codex, Windsurf, any MCP client. The discipline becomes the default instead of a habit you have to keep.

Your #3 failure (old memories quietly poisoning recommendations, “we use Jest” surviving a move to Vitest) is the genuinely hard one, and I won’t claim it’s solved. The lever that helps is recency: every item is timestamped, so when two facts conflict the newest wins and the stale belief gets out-ranked instead of trusted forever. Not perfect, but it keeps a six-month-old decision from passing itself off as current.

If you want to poke at it: vilix.ai. And honestly, if you’ve since found a cross-tool setup you’d call production grade, I’d still like to hear it, this is unsolved enough that I’m learning from how people rig it together.

I haven’t encountered this problem. I create session summaries at the end of threads so by then I know if I have been lied to. If I have been lied to, I address it before closing the thread and creating the session summary. If I didn’t know I was lied to, I don’t see what could be done about it. Any tagging would be based on the same lies. On the other hand, if any information in the database wasn’t true, I would think it would be very likely to degrade anyway and soon be pruned. I created an optimization skill for the journal, however, which has the agent go through and optimize tags, relationships, prune any noise, etc. This is more effective with tight pruning of course rather than tens of thousands of entries or something, which is fine since the most important entries are preserved. I keep the pruning tight to preserve only the most important entries. The next version of memory-journal-mcp is in final testing now and it’s a big release. But there is one upgrade I am making in the next version that is relevant. I plan to automate the optimization on startup of the server so it is fully automatic, which results in higher significance and relationship scoring.

Journal optimization skill (new version release will be a bit improved):