Three weeks ago Cursor began producing corrupted UTF-16 encoded files when they should have been UTF-8. The thrashing from UTF-16 to UTF-8 and back is costing time, money and corrupting code, scripts and documentation. I stated a notebook to capture the issues and fixes.
Cursor blames Windows, PowerShell, and the Cursor writer app while asserting that it cannot ensure all documents are produced as UTF-8, even when there are mandatory rules to do so.
Here is the notebook for everyone reference and sharing with your Agent:
description: UTF-8 only for all text sources — never UTF-16; agent write and validation workflow. Incident log — Scripts/Diagnostic Bundle/Documentation/FILE_ENCODING_VIOLATIONS_NOTEBOOK.md
globs: **/*.{mq5,mqh,ps1,py,md,mdc,txt,csv,ini,set}
alwaysApply: true
-–
# UTF-8 only (never UTF-16)
**Incident log (timestamped sessions, violations, fixes):** `Scripts/Diagnostic Bundle/Documentation/FILE_ENCODING_VIOLATIONS_NOTEBOOK.md` (cross-reference: `.cursorrules` → **CRITICAL: File Encoding Requirements**; fleet layout + binary parity: `.cursor/rules/fleet-tester-deploy-consistency.mdc`).
- **MUST NOT** save project text as **UTF-16** (LE/BE, with or without BOM). UTF-16 breaks Python (`SyntaxError: source code cannot contain null bytes`), PowerShell, linters, AWS SSM JSON parameters, and git diffs.
- **MUST** save as **UTF-8 without BOM** for scripts (`.ps1`, `.py`, `.sh`, `.bat`), Cursor rules (`.mdc`), Markdown, and JSON/YAML where tooling is sensitive. UTF-8 with BOM (`EF BB BF`) is tolerated but **normalize** strips BOM to one standard when scripts are touched.
- **Strategy Tester launch `.ini` and Expert parameter `.set` files** are **text** and **MUST** be UTF-8 (same UTF-16 prohibition). If they are **git-tracked**, `validate_repo_sources_utf8.py` enforces this (`.ini` and `.set` are in the full-repo text suffix list). On Fleet hosts, write them with UTF-8-safe paths only (see **FILE_ENCODING_VIOLATIONS_NOTEBOOK.md** for incident patterns).
- **EA `.ex5` and MetaTrader `*.exe`** are **binaries** — do not run text encoding conversion on them. **Consistency across Fleets** means **same deployed build / bytes** (see `.cursor/rules/fleet-tester-deploy-consistency.mdc`).
- Default to **ASCII** in source unless a clear requirement exists for non-ASCII characters.
## Cursor / VS Code
- **This repo commits** `.vscode/settings.json` with `files.encoding` = `utf8`, `files.autoGuessEncoding` = `false`, and the same for `[powershell]`, `[python]`, `[markdown]`. Keep it when editing so saves stay UTF-8 and the editor does not “guess” UTF-16 from BOM-less legacy files.
- **User settings** (`%APPDATA%\Cursor\User\settings.json`): if the global default is not UTF-8, align it with the workspace or UTF-16 saves can still occur outside this folder.
## Cursor hooks and git (mandatory)
- **Hooks (repo):** `.cursor/hooks.json` invokes `.cursor/hooks/Enforce-RepoTextUtf8.ps1` on `afterFileEdit`, `afterTabFileEdit`, and **every** `postToolUse` (no matcher — Cursor tool JSON shape varies; the script **recursively** finds path strings for `.ps1`/`.mdc`/`.py`/etc. under the repo and normalizes UTF-16 to UTF-8 via `Scripts\Robust Automated Processes\PowerShell\_EncodingValidation.ps1`). After editing hook files, **reload the Cursor window** so hooks reload.
- **Git (exhaustive — not only `.py`/`.ps1` EA paths):** `core.hooksPath=.githooks`. **`pre-commit`** runs `validate_repo_sources_utf8.py --git-staged` against **every staged text-like file** (suffixes/dotfiles matching the validator’s full-repo rules: `.md`, `.mdc`, `.json`, `.yml`, …). **`pre-push`** and **CI** run **`–full-repo`** on **all tracked** text-like files. Invalid UTF-8 / UTF-16 **fails the hook/build**. The only deliberate bypass is `git commit --no-verify` (avoid; use only in emergencies and fix encoding before the next commit).
- **Windows PowerShell defaults (stop UTF-16 at the source):** run `Scripts\Robust Automated Processes\PowerShell\Install-RepoGitHooks.ps1` once per clone — it also runs **`Install-RepoPowerShellUtf8Defaults.ps1`**, which adds **`CurrentUserAllHosts`** dot-source of **`Initialize-Utf8Session.ps1`** (sets `$PSDefaultParameterValues` so `Out-File` / `Export-Csv` / etc. use UTF-8 on Windows PowerShell 5.1). Use **`-SkipPowerShellUtf8ProfileDefaults`** only if you must not touch profiles. Open a **new terminal** after install.
## Preserving UTF-8 configuration (Cursor updates, sync, agents)
- **Product updates:** Cursor/VS Code **do not** replace or delete **workspace** `.vscode/settings.json` during upgrades — that file is part of the repository, not under `%APPDATA%\Cursor\`. Keep it **tracked in git** so it is restored from history if someone removes it locally.
- **Settings Sync / cloud backup:** If you use Settings Sync, conflicts or a restored **user** `settings.json` can reintroduce non-UTF-8 defaults. After a sync or machine restore, confirm **User** settings still use UTF-8 for files, or rely on **this workspace** (workspace settings override user for keys that are set here when the folder is open).
- **Agents (Cursor):** When changing `.vscode/settings.json`, **do not remove or replace** the UTF-8 keys (`files.encoding`, `files.autoGuessEncoding`, language-specific `files.encoding`, or the PowerShell `-NoLogo` terminal / automation profiles) unless the user explicitly requests a different encoding or terminal policy. Prefer **adding** new keys alongside these.
- **Comments:** `.vscode/settings.json` may use `//` comments (JSONC). Preserve the comment block at the top when editing the file.
## Agents (Cursor Write / StrReplace / patches)
- **CRITICAL — Windows, Cursor `Write` (known product issue):** On Windows, the agent **`Write`** tool can save **UTF-16 LE** (nulls between ASCII) for **new** files or **full** file bodies — Python then fails with *source code cannot contain null bytes*. This is **not** fixed by repo hooks alone. **MUST NOT** use **`Write`** to **create** a new `.py` / `.ps1` / `.mdc` or to dump a **complete** new file in one call when the repo root is on Windows. **MUST** use **`StrReplace`** (or another patch-style edit) on files that already exist as UTF-8, or write via **terminal** / **`py -3 -c`… `open(path,\“w\”,encoding=\“utf-8\”)`** for new files. Community report: `https://forum.cursor.com/t/windows-cursor-write-tool-creates-utf-16-le-files-instead-of-utf-8/148992\` — upvote/comment there; this rule exists until Cursor ships a fix.
- **Emit UTF-8.** After any edit that might have used `Write` by mistake, run **`validate_repo_sources_utf8.py` (with `–fix` if needed)** on the path before ending the turn. If you see UTF-16 symptoms (odd spacing, `FF FE` BOM, or Python null-byte errors), **convert before finishing**: run the normalizer below on the affected directory or file.
- **Do not** paste long **`py -c "…decode(‘utf-16’…` one-liners** — they burn tokens and duplicate logic. For one or a few files use the repo validator only:
-
`py -3 Scripts\Robust Automated Processes\Python\validate_repo_sources_utf8.py --root . --fix --paths “relative/path/file.py”`
-
Re-check with the same command **without** `–fix` (exit **0**).
- **Do not** use PowerShell `Out-File` without `-Encoding utf8` (default is often UTF-16). Prefer `[System.IO.File]::WriteAllText($path, $text, [System.Text.UTF8Encoding]::new($false))` or the repo normalizer.
## Mandatory normalization (after bulk edits or unknown encoding)
From repo root:
- **PowerShell:** `Scripts\PowerShell\Run-NormalizeRepoTextUtf8.ps1`
- **Python:** `python Scripts\Tools\normalize_repo_text_utf8.py --root .`
Use `-DryRun` / `–dry-run` to list changes only. Narrow with `–extensions .md,.mdc,.py,.ps1`.
## Validation (MQL and CI)
- **`.mq5` / `.mqh`:** `Scripts\Robust Automated Processes\PowerShell\ROBUST_FILE_ENCODING_VALIDATOR.ps1` (uses `_EncodingValidation.ps1` — `Test-FileEncoding`, `Convert-FileToUTF8` when UTF-16 is detected).
- **Policy / Python / PowerShell / hooks:** `Scripts\Robust Automated Processes\Python\validate_repo_sources_utf8.py` — default and CI = **full tracked tree** of text-like files; **`–git-staged`** for commits; **`–policy-sources-only`** is legacy narrow scan only. See **`FILE_ENCODING_VIOLATIONS_NOTEBOOK.md`** for failure postmortems.
Agents must run the **normalizer** when documentation or scripts show mojibake, NUL-spacing, or encoding-related parse errors. Log notable incidents in **`FILE_ENCODING_VIOLATIONS_NOTEBOOK.md`** when the user asks for a durable record.
