Spending tokens to repair UTF-16 that should have been UTF-8

Three weeks ago Cursor began producing corrupted UTF-16 encoded files when they should have been UTF-8. The thrashing from UTF-16 to UTF-8 and back is costing time, money and corrupting code, scripts and documentation. I stated a notebook to capture the issues and fixes.

Cursor blames Windows, PowerShell, and the Cursor writer app while asserting that it cannot ensure all documents are produced as UTF-8, even when there are mandatory rules to do so.

Here is the notebook for everyone reference and sharing with your Agent:


description: UTF-8 only for all text sources — never UTF-16; agent write and validation workflow. Incident log — Scripts/Diagnostic Bundle/Documentation/FILE_ENCODING_VIOLATIONS_NOTEBOOK.md

globs: **/*.{mq5,mqh,ps1,py,md,mdc,txt,csv,ini,set}

alwaysApply: true

-–

# UTF-8 only (never UTF-16)

**Incident log (timestamped sessions, violations, fixes):** `Scripts/Diagnostic Bundle/Documentation/FILE_ENCODING_VIOLATIONS_NOTEBOOK.md` (cross-reference: `.cursorrules` → **CRITICAL: File Encoding Requirements**; fleet layout + binary parity: `.cursor/rules/fleet-tester-deploy-consistency.mdc`).

- **MUST NOT** save project text as **UTF-16** (LE/BE, with or without BOM). UTF-16 breaks Python (`SyntaxError: source code cannot contain null bytes`), PowerShell, linters, AWS SSM JSON parameters, and git diffs.

- **MUST** save as **UTF-8 without BOM** for scripts (`.ps1`, `.py`, `.sh`, `.bat`), Cursor rules (`.mdc`), Markdown, and JSON/YAML where tooling is sensitive. UTF-8 with BOM (`EF BB BF`) is tolerated but **normalize** strips BOM to one standard when scripts are touched.

- **Strategy Tester launch `.ini` and Expert parameter `.set` files** are **text** and **MUST** be UTF-8 (same UTF-16 prohibition). If they are **git-tracked**, `validate_repo_sources_utf8.py` enforces this (`.ini` and `.set` are in the full-repo text suffix list). On Fleet hosts, write them with UTF-8-safe paths only (see **FILE_ENCODING_VIOLATIONS_NOTEBOOK.md** for incident patterns).

- **EA `.ex5` and MetaTrader `*.exe`** are **binaries** — do not run text encoding conversion on them. **Consistency across Fleets** means **same deployed build / bytes** (see `.cursor/rules/fleet-tester-deploy-consistency.mdc`).

- Default to **ASCII** in source unless a clear requirement exists for non-ASCII characters.

## Cursor / VS Code

- **This repo commits** `.vscode/settings.json` with `files.encoding` = `utf8`, `files.autoGuessEncoding` = `false`, and the same for `[powershell]`, `[python]`, `[markdown]`. Keep it when editing so saves stay UTF-8 and the editor does not “guess” UTF-16 from BOM-less legacy files.

- **User settings** (`%APPDATA%\Cursor\User\settings.json`): if the global default is not UTF-8, align it with the workspace or UTF-16 saves can still occur outside this folder.

## Cursor hooks and git (mandatory)

- **Hooks (repo):** `.cursor/hooks.json` invokes `.cursor/hooks/Enforce-RepoTextUtf8.ps1` on `afterFileEdit`, `afterTabFileEdit`, and **every** `postToolUse` (no matcher — Cursor tool JSON shape varies; the script **recursively** finds path strings for `.ps1`/`.mdc`/`.py`/etc. under the repo and normalizes UTF-16 to UTF-8 via `Scripts\Robust Automated Processes\PowerShell\_EncodingValidation.ps1`). After editing hook files, **reload the Cursor window** so hooks reload.

- **Git (exhaustive — not only `.py`/`.ps1` EA paths):** `core.hooksPath=.githooks`. **`pre-commit`** runs `validate_repo_sources_utf8.py --git-staged` against **every staged text-like file** (suffixes/dotfiles matching the validator’s full-repo rules: `.md`, `.mdc`, `.json`, `.yml`, …). **`pre-push`** and **CI** run **`–full-repo`** on **all tracked** text-like files. Invalid UTF-8 / UTF-16 **fails the hook/build**. The only deliberate bypass is `git commit --no-verify` (avoid; use only in emergencies and fix encoding before the next commit).

- **Windows PowerShell defaults (stop UTF-16 at the source):** run `Scripts\Robust Automated Processes\PowerShell\Install-RepoGitHooks.ps1` once per clone — it also runs **`Install-RepoPowerShellUtf8Defaults.ps1`**, which adds **`CurrentUserAllHosts`** dot-source of **`Initialize-Utf8Session.ps1`** (sets `$PSDefaultParameterValues` so `Out-File` / `Export-Csv` / etc. use UTF-8 on Windows PowerShell 5.1). Use **`-SkipPowerShellUtf8ProfileDefaults`** only if you must not touch profiles. Open a **new terminal** after install.

## Preserving UTF-8 configuration (Cursor updates, sync, agents)

- **Product updates:** Cursor/VS Code **do not** replace or delete **workspace** `.vscode/settings.json` during upgrades — that file is part of the repository, not under `%APPDATA%\Cursor\`. Keep it **tracked in git** so it is restored from history if someone removes it locally.

- **Settings Sync / cloud backup:** If you use Settings Sync, conflicts or a restored **user** `settings.json` can reintroduce non-UTF-8 defaults. After a sync or machine restore, confirm **User** settings still use UTF-8 for files, or rely on **this workspace** (workspace settings override user for keys that are set here when the folder is open).

- **Agents (Cursor):** When changing `.vscode/settings.json`, **do not remove or replace** the UTF-8 keys (`files.encoding`, `files.autoGuessEncoding`, language-specific `files.encoding`, or the PowerShell `-NoLogo` terminal / automation profiles) unless the user explicitly requests a different encoding or terminal policy. Prefer **adding** new keys alongside these.

- **Comments:** `.vscode/settings.json` may use `//` comments (JSONC). Preserve the comment block at the top when editing the file.

## Agents (Cursor Write / StrReplace / patches)

- **CRITICAL — Windows, Cursor `Write` (known product issue):** On Windows, the agent **`Write`** tool can save **UTF-16 LE** (nulls between ASCII) for **new** files or **full** file bodies — Python then fails with *source code cannot contain null bytes*. This is **not** fixed by repo hooks alone. **MUST NOT** use **`Write`** to **create** a new `.py` / `.ps1` / `.mdc` or to dump a **complete** new file in one call when the repo root is on Windows. **MUST** use **`StrReplace`** (or another patch-style edit) on files that already exist as UTF-8, or write via **terminal** / **`py -3 -c`… `open(path,\“w\”,encoding=\“utf-8\”)`** for new files. Community report: `https://forum.cursor.com/t/windows-cursor-write-tool-creates-utf-16-le-files-instead-of-utf-8/148992\` — upvote/comment there; this rule exists until Cursor ships a fix.

- **Emit UTF-8.** After any edit that might have used `Write` by mistake, run **`validate_repo_sources_utf8.py` (with `–fix` if needed)** on the path before ending the turn. If you see UTF-16 symptoms (odd spacing, `FF FE` BOM, or Python null-byte errors), **convert before finishing**: run the normalizer below on the affected directory or file.

- **Do not** paste long **`py -c "…decode(‘utf-16’…` one-liners** — they burn tokens and duplicate logic. For one or a few files use the repo validator only:

  • `py -3 Scripts\Robust Automated Processes\Python\validate_repo_sources_utf8.py --root . --fix --paths “relative/path/file.py”`

  • Re-check with the same command **without** `–fix` (exit **0**).

- **Do not** use PowerShell `Out-File` without `-Encoding utf8` (default is often UTF-16). Prefer `[System.IO.File]::WriteAllText($path, $text, [System.Text.UTF8Encoding]::new($false))` or the repo normalizer.

## Mandatory normalization (after bulk edits or unknown encoding)

From repo root:

- **PowerShell:** `Scripts\PowerShell\Run-NormalizeRepoTextUtf8.ps1`

- **Python:** `python Scripts\Tools\normalize_repo_text_utf8.py --root .`

Use `-DryRun` / `–dry-run` to list changes only. Narrow with `–extensions .md,.mdc,.py,.ps1`.

## Validation (MQL and CI)

- **`.mq5` / `.mqh`:** `Scripts\Robust Automated Processes\PowerShell\ROBUST_FILE_ENCODING_VALIDATOR.ps1` (uses `_EncodingValidation.ps1` — `Test-FileEncoding`, `Convert-FileToUTF8` when UTF-16 is detected).

- **Policy / Python / PowerShell / hooks:** `Scripts\Robust Automated Processes\Python\validate_repo_sources_utf8.py` — default and CI = **full tracked tree** of text-like files; **`–git-staged`** for commits; **`–policy-sources-only`** is legacy narrow scan only. See **`FILE_ENCODING_VIOLATIONS_NOTEBOOK.md`** for failure postmortems.

Agents must run the **normalizer** when documentation or scripts show mojibake, NUL-spacing, or encoding-related parse errors. Log notable incidents in **`FILE_ENCODING_VIOLATIONS_NOTEBOOK.md`** when the user asks for a durable record.

Hi there!

We detected that this may be a bug report, so we’ve moved your post to the Bug Reports category.

To help us investigate and fix this faster, could you edit your original post to include the details from the template below?

Bug Report Template - Click to expand

Where does the bug appear (feature/product)?

  • Cursor IDE
  • Cursor CLI
  • Background Agent (GitHub, Slack, Web, Linear)
  • BugBot
  • Somewhere else…

Describe the Bug
A clear and concise description of what the bug is.


Steps to Reproduce
How can you reproduce this bug? We have a much better chance at fixing issues if we can reproduce them!


Expected Behavior
What is meant to happen here that isn’t working correctly?


Screenshots / Screen Recordings
If applicable, attach images or videos (.jpg, .png, .gif, .mp4, .mov)


Operating System

  • Windows 10/11
  • MacOS
  • Linux

Version Information

  • For Cursor IDE: Menu → About Cursor → Copy
  • For Cursor CLI: Run agent about in your terminal
IDE:
Version: 2.xx.x
VSCode Version: 1.105.1
Commit: ......

CLI:
CLI Version 2026.01.17-d239e66

For AI issues: which model did you use?
Model name (e.g., Sonnet 4, Tab…)


For AI issues: add Request ID with privacy disabled
Request ID: f9a7046a-279b-47e5-ab48-6e8dc12daba1
For Background Agent issues, also post the ID: bc-…


Additional Information
Add any other context about the problem here.


Does this stop you from using Cursor?

  • Yes - Cursor is unusable
  • Sometimes - I can sometimes use Cursor
  • No - Cursor works, but with this issue

The more details you provide, the easier it is for us to reproduce and fix the issue. Thanks!

This is a known issue affecting file writes on Windows. The agent’s Write and StrReplace tools use their own encoding detection pipeline that can select the wrong encoding (UTF-16 LE, Windows-1252, or GBK depending on the environment), bypassing the files.encoding setting in .vscode/settings.json. Our team is actively investigating this.

Your workaround suite (Cursor hooks, git validators, PowerShell profile defaults) is impressively thorough and well-documented. For others hitting this, these are solid mitigations until the fix ships.

You may also find this related thread helpful, where other Windows users have reported similar encoding corruption and shared additional context:

Could you share your Cursor version (Menu > About Cursor > Copy)? That would help us confirm whether recent encoding-related fixes are included in your build.

Version: 3.1.17 (user setup)
VSCode Version: 1.105.1
Commit: fce1e9ab7844f9ea35793da01e634aa7e50bce90
Date: 2026-04-19T19:33:58.189Z
Layout: editor
Build Type: Stable
Release Track: Default
Electron: 39.8.1
Chromium: 142.0.7444.265
Node.js: 22.22.1
V8: 14.2.231.22-electron.0
OS: Windows_NT x64 10.0.26200

I will try to send a .md file of the agent assessments and actions taken to resolve the issue, but you can see that even constructing the .md file produced UTF-16 that had to be re-encoded to UTF-8 (below)

Expanding the export with the fuller sections that failed on the first write (likely size limits).

## Hooks detail (this session)

- `postToolUse` in `.cursor/hooks.json` runs **without** a narrow `matcher` so the UTF-8 hook can fire after all tools.

- `Enforce-RepoTextUtf8.ps1` **recursively** scans hook JSON for path strings (policy extensions) in addition to `file_path` / `tool_input`, to match varying Cursor versions.

## `Sync-VirtualEAFleetVirtualEaFromS3.ps1` (technical)

- `Get-RunningInstanceSubset`: `aws ec2 describe-instance-status` = one `–instance-ids` then each id as a separate argument; **not** a single merged string. Uses **ArrayList** + `ToArray()` for **PowerShell 5.1** compatibility.

- `Send-FleetSsmEncoded`: `aws ssm send-command` = same `instance-ids` pattern; `–parameters` = `file://` + normalized temp JSON path. stderr from `aws` under `$ErrorActionPreference = Stop`: wait loop used `cmd /c … 2>nul` for JSON fetch (already in script).

- UTF-8 header comment at top: mis-encoded **.ps1** is a common cause of “missing )” and `&` parse errors, not bad logic.

- **`-VerifyOnly`:** in session, SSM on Fleet-1 + Fleet-2; both **Success**; `imports_ok` and `VEA_CONSISTENCY_OK` output.

## `hull_top20_replay_audit.py` (production ranks)

- `_load_production_rows`: if `–assessment-json` points to a file, **read JSON**; else run **`_run_historical_assessment`**. `_resolve_assessment_json_path` = **abspath** (relative cwd). If user passed a path that does not exist, **stderr** warns before running miner. File must be **UTF-8** (use validator `–fix` if `Write` produced UTF-16).

*Meta: if this `.md` was first created with the agent **Write** tool on Windows, it may have been UTF-16 on disk; run `validate_repo_sources_utf8.py --fix` on the path to normalize to UTF-8 (same class of issue as the rest of this doc).*

Here’s the export file and what was done.

Path (for copy / PDF / other repo):
G:\MexAtlantic-MT5\Scripts\Diagnostic Bundle\Documentation\SESSION_EXPORT_2026-04-22_Cursor_UTF8_PowerShell_Fleet_Sync.md

Contents (synthesized, not a word‑for‑word log): executive summary, first‑write vs second‑write, PowerShell/Fleet, hooks, .gitattributes, Sync-VirtualEAFleetVirtualEaFromS3 details, hull_top20_replay_audit, Q&A, cross‑refs, and command snippets.

Encoding: The file was first saved as UTF-16 by the same tooling path; it was normalized to UTF-8 with validate_repo_sources_utf8.py --fix, and the doc now has a short meta line under Purpose about that. Re‑validate on that path passes now.

SESSION_EXPORT_2026-04-22_Cursor_UTF8_PowerShell_Fleet_Sync.txt (5.8 KB)

Thanks for the version info and the session export, Eric.

3.1.17 is current stable, and the encoding pipeline fix hasn’t shipped yet, so the behavior you’re seeing is expected on this version. Your workaround stack (post-write hooks, git validators, PowerShell profile defaults) remains the best mitigation for now.

The detail about the .md export itself being written as UTF-16 is a good example of the scope — it’s not limited to script files, it affects any file the Write tool creates on Windows. That data point has been noted on the tracking issue.

We’ll follow up in this thread when a fix ships.

Here is what happens hundreds of time per day on .py, .ini, .set, etc files
Fixing encoding: the editor saved prove_st_ini_set_deploy.py as UTF-16 again. Rewriting the file as UTF-8 with the CLI flag included.

Because they are mis-encoded I have processes not being able to read their configuration files and processes failing. It has cost me 100’s of hours, thousands of $'s in wasted AWS processing and hundreds of thousands of $'s in lost income.

Here’s an example of the consequences and losses I am incurring by failed encoding and why the fix is urgently needed by everyone.

Here’s what was going wrong and what we changed.

1. Parser errors on param(

The script on disk was saved as UTF-16 LE (typical when an editor writes “Unicode”). Windows PowerShell 5.1 often misparses that as UTF-8, so you get bogus errors like “Missing closing ‘)’” on param(.

The file was converted to UTF-8 (no BOM). After that, the script parses cleanly.

2. Split-Path -LiteralPath … -Parent

In PowerShell 5.1, -LiteralPath and -Parent don’t work together on Split-Path (ambiguous parameter set). That was replaced with:

Split-Path -Path $repo -Parent

3. Comment block

The long <# … #> help block was removed and replaced with short # lines so nothing in the header could confuse the parser.

4. [CmdletBinding()]

It was removed; it isn’t required for a deploy script and was contributing to odd binding messages in some runs.


What you should do

  1. Pull / save the updated Install-Ut100T1StValidationFromRepo.ps1.

  2. If the editor turns it back into UTF-16, either:

    • In VS Code / Cursor: Save with Encoding → UTF-8, or

    • From PowerShell (repo root):

$p = ‘G:\MexAtlantic-MT5\Scripts\PowerShell\Build\Install-Ut100T1StValidationFromRepo.ps1’

$t = [IO.File]::ReadAllText($p, [Text.Encoding]::Unicode)

[IO.File]::WriteAllText($p, $t, (New-Object System.Text.UTF8Encoding $false))

  1. Run:

cd G:\MexAtlantic-MT5\Scripts\PowerShell\Build

.\Install-Ut100T1StValidationFromRepo.ps1

On a machine without C:\MT5\T1 you should get:
Portable terminal not found: C:\MT5\T1 — that means the script is running correctly; use -Mt5Root / -Terminal if your portable tree is elsewhere.

I can add a utf8-only note for *.ps1 under .editorconfig if you want that enforced for the repo.

it looks like Composer 2.0 was a key part of the UTF-16 problem. Using Opus 4.7 and it found/fixed lots of files and thoroughly reviewed rules and notebooks.

Cleanup complete. Here’s the full ledger:

Conversions

43 files converted UTF‑16 LE → UTF‑8 (no BOM):

Batch Count Scope
A 9 _enc_test.py (no‑BOM, manual fix) + small JSON in repo root / Services\AWS\Scripts\PowerShell\ / ParityRun\
B 12 MQL5\Profiles\Tester\*.ini + .set
C 22 Profiles\Tester\*.set

All verified (re‑read post‑rewrite, 0 NULs, no BOM, sane ASCII headers like ;Expert / ; saved / {"…).

Plus deleted:

  • G:\MexAtlantic-MT5\100 (10 bytes of \ufeffFalse\r\n junk)

  • G:\MexAtlantic-MT5\.tmp_scan_utf16.ps1 (the file that the agent Write tool itself produced as UTF‑16 LE during this session — a live reproduction of the Cursor‑on‑Windows defect documented in your utf8-only-source.mdc)

Intentionally deferred per “HIGH‑RISK ONLY” scope (122 files)

Group Count Bytes Notes
_utf8_backup_* archived CSVs 79 ~24 GB Historical backup dumps
Staged Diagnostics CSV exports (Scripts\Diagnostic Bundle\Staged Diagnostics\…) 11 ~1.0 GB Big diagnostic snapshots
Scripts\Diagnostic Bundle\Python\DataCollection\…\*.csv + VirtualEA\PriceData\DE40_M1_Rates.csv 3 ~370 MB Dataset exports
.analysis\RUSS2000_PositionOutcomes.csv 1 ~100 MB One-off analysis export
Vendor MQL5 example sources (MQL5\Indicators\Examples\…, Scripts\Examples\…, Experts\Examples\…) 13 small MetaQuotes ships these as UTF‑16
Vendor MQL5\Experts\Examples\Math 3D\Sets\*.set 11 small Vendor examples
Vendor Profiles\SymbolSets\forex.*.set 3 tiny Vendor symbol sets
MT5 runtime: Tester\Agent-127.0.0.1-3000\common.ini, config\hotkeys.ini 2 tiny Re-written by MT5 each run

One borderline file you may want to circle back to: Scripts\Diagnostic Bundle\Python\VirtualEA\_smoke_ut100_compare\ut100_trade_log.json — 510 bytes, UTF‑16 LE BOM, small JSON in a directory the agent context often touches. It fell outside the strict “repo root / Services\AWS / ParityRun” filter. Worth a fix later if it ever shows up in attached context, but it’s small enough that one stray re-run of the producer (a PS Out-File somewhere) will recreate it.

Why this fixes the Anthropic block

  1. The previously‑attached files that the prompt classifier was choking on (NUL‑interleaved bytes inside JSON string values) are gone — those files are now plain UTF‑8.

  2. Every category that’s likely to be auto‑attached (open files, recently viewed, rule files, hook output, terminal capture, --paths echoes from your validator) is now clean for the high‑risk surface.

To stop this happening again

  • Start a new chat before continuing. Anthropic is sent the entire prior transcript on every turn, so even though disk is clean, this conversation still contains the UTF‑16 byte hex I dumped during diagnosis. A fresh chat drops that history.

  • The agent (me) just violated the Write rule on Windows. Going forward in your sessions, the agent must use StrReplace against existing UTF‑8 files, or create new files via py -3 -c "open(p,'w',encoding='utf-8').write(...)" over the terminal. Your postToolUse hook’s auto‑conversion is a safety net, not a fix — the broken file is still in the current turn’s tool‑result that gets sent to Anthropic. The hook only saves the next turn.

  • PowerShell Out-File default is UTF‑16 on Windows PS 5.1. Your repo has Scripts\Robust Automated Processes\PowerShell\Install-RepoPowerShellUtf8Defaults.ps1 — confirm that’s been run on this machine (it sets $PSDefaultParameterValues for Out-File / Export-Csv). Without it, every redirect > file.json you’ve ever done in this repo silently produced a UTF‑16 file.

  • MT5 writes .ini and .set as UTF‑16 natively when it generates tester profiles. They will reappear after every backtest in MQL5\Profiles\Tester\ and Profiles\Tester\. Your pre‑commit hook (pre-commit.ps1validate_repo_sources_utf8.py --git-staged) catches them before commit; just be aware that the on‑disk state will keep regressing until you re‑normalize after a fresh tester run.

Nothing in your git status (the staged set, the EAs, the rules) was disturbed beyond converting the encoding of the files listed above. The 165‑item list, the verified counts (43 fixed / 122 intentionally deferred), and the live reproduction of the Cursor Write defect should be enough corroboration to file an upstream Cursor issue if you want — your existing rule already links the community thread.

fix Scripts\Diagnostic Bundle\Python\VirtualEA\_smoke_ut100_compare\ut100_trade_log.json — 510 bytes, UTF‑16 LE BOM, small JSON in a directory the agent context often touches

Fixed and verified.

Before After
Size 510 bytes 254 bytes
First 2 bytes FF FE (UTF-16 LE BOM) 5B 0A ([\n — pure ASCII)
NUL bytes 254 0
Encoding UTF-16 LE UTF-8 (no BOM)
JSON valid n/a yes — 2-element list, first record {"action": "OPEN", "ticket": 1, "time": 1704067260, "direction": "BUY", "pri…

Scripts\Diagnostic Bundle\Python\VirtualEA\_smoke_ut100_compare\ut100_trade_log.json is now plain UTF-8 and the JSON content is intact (it’s a small smoke-test trade log with two events: one OPEN, presumably one CLOSE). Conversion was done through your own validate_repo_sources_utf8.py --fix per the utf8-only-source.mdc rule.

I’m seeing issues with UTF-16, seemed to start when Cursor switched to Composer 2 but now affects other models subsequently. Breaks running python scripts among other things.

This is on Mac, latest Cursor version updated today.

For anyone hitting this on Mac or Windows: the underlying bug is in how files get written to disk. Until the fix ships, starting a fresh chat session (to clear any corrupted byte history from the transcript) and avoiding the Write tool for new files are the best mitigations. On Windows, @user1590 's hook and validator setup (detailed in post #1) is comprehensive.

Where does the bug appear (feature/product)?

Cursor IDE

Describe the Bug

Cursor’s agent file-edit tools (Write and StrReplace) silently encode non-ASCII content as GBK bytes when writing to disk, even though every relevant encoding setting (OS locale, JVM defaults, Cursor user files.encoding, workspace .vscode/settings.json) is explicitly set to UTF-8. As a result, agent-generated source files (Java, XML, YAML, properties, Markdown, etc.) cannot be compiled or parsed by standard toolchains that default to UTF-8.

Importantly, applying StrReplace with a pure-ASCII new_string to an existing UTF-8 file does not corrupt the file’s existing non-ASCII bytes. So the corruption is isolated to the path that serializes tool-call arguments containing non-ASCII characters into bytes on disk, not the file-read path.

Steps to Reproduce

  1. Open a remote SSH workspace on a Linux host with LANG=en_US.UTF-8 and JVM defaults file.encoding=UTF-8.

  2. In Cursor user settings, set files.encoding=utf8 and files.autoGuessEncoding=false.

  3. Add .vscode/settings.json to the workspace with the same files.encoding=utf8 and per-language overrides for java/xml/yaml/properties/sql/markdown.

  4. Restart Cursor so the settings are applied.

  5. In an agent chat, ask the agent to use its Write tool to create a small file (any extension: .java, .md, .txt, .xml) whose content contains the three CJK Unified Ideographs at codepoints U+9500, U+552E, U+5355.

  6. After the agent writes the file, in the same shell run:

    python3 -c "import pathlib,sys; raw=pathlib.Path(sys.argv[1]).read_bytes(); \
    print('hex:', raw.hex(' ')); \
    print('utf8 hit:', raw.find(bytes.fromhex('e9 94 80 e5 94 ae e5 8d 95'))); \
    print('gbk  hit:', raw.find(bytes.fromhex('cf fa ca db b5 a5')))" /path/to/file
    
  7. Observe that the UTF-8 byte sequence is NOT found, while the GBK byte sequence IS found.

  8. Repeat the same procedure with StrReplace, providing a new_string that contains those three codepoints. Same result: GBK bytes on disk.

  9. Control: in the very same shell, run a python3 heredoc that does Path(...).write_text(text, encoding="utf-8") with the same content. The on-disk bytes are correct UTF-8 (e9 94 80 e5 94 ae e5 8d 95). This rules out the OS, shell, locale, JVM, and filesystem.

  10. Control: apply StrReplace with a pure-ASCII new_string to a file whose existing content includes UTF-8 non-ASCII bytes. Verify the existing non-ASCII bytes are preserved. This isolates the bug to tool-call argument serialization, not the file-read path.

Expected Behavior

When the agent’s Write or StrReplace tool is given content containing characters outside the ASCII range, the bytes written to disk should be the UTF-8 encoding of those characters (e.g. e9 94 80 e5 94 ae e5 8d 95 for U+9500 U+552E U+5355), regardless of the OS locale on the local Cursor host. The IDE setting files.encoding=utf8 should govern the agent write path, or the tool should hard-code UTF-8 since UTF-8 is the universally expected default for source code and text files.

Screenshots / Screen Recordings

Operating System

MacOS

Version Information

Version: 3.2.16 (Universal)
VSCode Version: 1.105.1
Commit: 3e548838cf824b70851dd3ef27d0c6aae371b3f0
Date: 2026-04-28T21:07:47.682Z
Layout: editor
Build Type: Stable
Release Track: Default
Electron: 39.8.1
Chromium: 142.0.7444.265
Node.js: 22.22.1
V8: 14.2.231.22-electron.0
OS: Darwin arm64 25.3.0

For AI issues: which model did you use?

Claude Opus 4.7 (Agent mode)
GPT 5.5 (Agent mode)

For AI issues: add Request ID with privacy disabled

Request ID: a22ba5d9-d11c-4625-b3d8-0345b7c60793

Additional Information

Environment specifics:

  • Remote workspace: Linux 5.10.134-19.1.al8.x86_64 (Aliyun)
  • Shell locale on the remote: LANG=en_US.UTF-8, all LC_* = en_US.UTF-8
  • JVM on the remote: file.encoding=UTF-8, native.encoding=UTF-8, sun.jnu.encoding=UTF-8

Workaround currently in use: routing every non-ASCII edit through the Shell tool with a python3 heredoc that does explicit write_text(..., encoding="utf-8"). This is reliable but completely defeats the purpose of having dedicated file-edit tools.

Suspected root cause: somewhere on the agent’s tool-call serialization path, a string→bytes conversion is using the platform default charset (which is GBK on Chinese-locale Windows hosts) instead of UTF-8. Likely culprits: String#getBytes() with no charset arg, Buffer.from(str) without ‘utf8’, Encoding.Default in .NET, or FileWriter / FileOutputStream(... , Charset.defaultCharset()). Forcing all agent-side file IO to StandardCharsets.UTF_8 should fix this.

Does this stop you from using Cursor

Yes - Cursor is unusable

This is yet another example of teh persistent and expensive and debilitating impacts of this bug.

The canvas is actually already UTF-8 (starts with impo = “import”). The subagent misread the encoding. Let me read it and update its content with the audit findings.

The original content is recoverable. Let me restore it properly.

File restored (18860 bytes, very close to original 18855, with only 2 replacement characters). Now let me read the restored canvas to understand its structure and update it.

The file IS correct UTF-8 on disk (starts with import {). The Read tool was showing garbled text likely due to caching. Let me now update the canvas content with the audit findings.

Good, I can see the structure. The canvas needs its status data updated to reflect the 2026-05-05 audit. Let me update the key fields.

The PowerShell escaping is problematic for regex patterns with quotes. Let me use a Python script file instead.

Now I can see the exact lines. Let me update the canvas with a Python script approach since PowerShell escaping is problematic.

PowerShell is interfering with the Python strings. Let me write a helper script to a temp file and run it.

UTF-16 from the Write tool on Windows. Let me fix the encoding and run again.

Canvas updated. Let me clean up the temp file and verify everything is complete.

All documentation updates are complete. Here’s the full summary:

i don’t recall who to give credit to for this solution, but here is CoPilot’s version of the fix

Yes — you’re thinking of the “Beta: Use Unicode UTF‑8 for worldwide language support” setting.

:white_check_mark: What the setting is

It’s a system locale option in Windows 10/11 that forces Windows to use UTF‑8 (code page 65001) as the default encoding for non‑Unicode (legacy) programs instead of a regional code page (like Windows‑1252). [[stackoverflow.com]]


:round_pushpin: Where you find it

You were right — it’s in a somewhat “hidden”/legacy place:

Control Panel path:

Control Panel → Region → Administrative tab → Change system locale…

Then check:

☑ Beta: Use Unicode UTF‑8 for worldwide language support

Or on newer Windows 11 builds it’s exposed in Settings:

Settings → Time & Language → Language & Region → (expand language) → toggle UTF‑8
``` [3](https://www.elevenforum.com/t/enable-or-disable-unicode-utf-8-for-worldwide-language-support-in-windows-11.38649/)

---

### ⚙️ What it actually does
Enabling it:

- Sets the **system ANSI and OEM code pages to UTF‑8 (65001)** [1](https://stackoverflow.com/questions/56419639/what-does-beta-use-unicode-utf-8-for-worldwide-language-support-actually-do)  
- Affects all **non‑Unicode applications** (legacy apps using `char*`, Win32 “A” APIs, etc.)
- Makes those apps interpret text as UTF‑8 instead of a regional encoding
- Effectively makes UTF‑8 the **system-wide default encoding baseline**

---

### ⚠️ Why it’s labeled “Beta”
Because:

- Some **older apps assume legacy code pages** and can break or show garbled text
- Certain fonts, rendering paths, and APIs don’t behave consistently under UTF‑8 system locale
- Compatibility issues still exist (especially with very old Win32 software)

So Microsoft never fully removed the “Beta” label.

---

### 🧠 Key distinction (important)
This setting **does NOT**:
- Change encoding for modern apps (they already use Unicode internally)
- Force every file or app to save in UTF‑8

It only changes how **legacy/non‑Unicode code interprets text** at the OS level.

---

### 🧩 Registry equivalent (for automation)
Since you prefer PowerShell-style control, this is what it maps to:

HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage

ACP = 65001 OEMCP = 65001 MACCP = 65001

(65501 / UTF‑8 code page) [4](https://bing.com/search?q=Use+Unicode+UTF-8+for+worldwide+language+support+windows+setting+description)

---

### ✔️ Bottom line
You remembered correctly:

> The Windows setting that enforces UTF‑8 system-wide (for legacy apps) is  
> **“Beta: Use Unicode UTF‑8 for worldwide language support.”**

It’s essentially a **global override of the system locale encoding to UTF‑8**.

---

If you want, I can show you a PowerShell snippet to detect or enforce it across systems.