Agent Write/StrReplace tools encode non-ASCII content as GBK on disk despite files.encoding=utf8

Where does the bug appear (feature/product)?

Cursor IDE

Describe the Bug

Cursor’s agent file-edit tools (Write and StrReplace) silently encode non-ASCII content as GBK bytes when writing to disk, even though every relevant encoding setting (OS locale, JVM defaults, Cursor user files.encoding, workspace .vscode/settings.json) is explicitly set to UTF-8. As a result, agent-generated source files (Java, XML, YAML, properties, Markdown, etc.) cannot be compiled or parsed by standard toolchains that default to UTF-8.

Importantly, applying StrReplace with a pure-ASCII new_string to an existing UTF-8 file does not corrupt the file’s existing non-ASCII bytes. So the corruption is isolated to the path that serializes tool-call arguments containing non-ASCII characters into bytes on disk, not the file-read path.

Steps to Reproduce

  1. Open a remote SSH workspace on a Linux host with LANG=en_US.UTF-8 and JVM defaults file.encoding=UTF-8.

  2. In Cursor user settings, set files.encoding=utf8 and files.autoGuessEncoding=false.

  3. Add .vscode/settings.json to the workspace with the same files.encoding=utf8 and per-language overrides for java/xml/yaml/properties/sql/markdown.

  4. Restart Cursor so the settings are applied.

  5. In an agent chat, ask the agent to use its Write tool to create a small file (any extension: .java, .md, .txt, .xml) whose content contains the three CJK Unified Ideographs at codepoints U+9500, U+552E, U+5355.

  6. After the agent writes the file, in the same shell run:

    python3 -c "import pathlib,sys; raw=pathlib.Path(sys.argv[1]).read_bytes(); \
    print('hex:', raw.hex(' ')); \
    print('utf8 hit:', raw.find(bytes.fromhex('e9 94 80 e5 94 ae e5 8d 95'))); \
    print('gbk  hit:', raw.find(bytes.fromhex('cf fa ca db b5 a5')))" /path/to/file
    
  7. Observe that the UTF-8 byte sequence is NOT found, while the GBK byte sequence IS found.

  8. Repeat the same procedure with StrReplace, providing a new_string that contains those three codepoints. Same result: GBK bytes on disk.

  9. Control: in the very same shell, run a python3 heredoc that does Path(...).write_text(text, encoding="utf-8") with the same content. The on-disk bytes are correct UTF-8 (e9 94 80 e5 94 ae e5 8d 95). This rules out the OS, shell, locale, JVM, and filesystem.

  10. Control: apply StrReplace with a pure-ASCII new_string to a file whose existing content includes UTF-8 non-ASCII bytes. Verify the existing non-ASCII bytes are preserved. This isolates the bug to tool-call argument serialization, not the file-read path.

Expected Behavior

When the agent’s Write or StrReplace tool is given content containing characters outside the ASCII range, the bytes written to disk should be the UTF-8 encoding of those characters (e.g. e9 94 80 e5 94 ae e5 8d 95 for U+9500 U+552E U+5355), regardless of the OS locale on the local Cursor host. The IDE setting files.encoding=utf8 should govern the agent write path, or the tool should hard-code UTF-8 since UTF-8 is the universally expected default for source code and text files.

Screenshots / Screen Recordings

Operating System

MacOS

Version Information

Version: 3.2.16 (Universal)
VSCode Version: 1.105.1
Commit: 3e548838cf824b70851dd3ef27d0c6aae371b3f0
Date: 2026-04-28T21:07:47.682Z
Layout: editor
Build Type: Stable
Release Track: Default
Electron: 39.8.1
Chromium: 142.0.7444.265
Node.js: 22.22.1
V8: 14.2.231.22-electron.0
OS: Darwin arm64 25.3.0

For AI issues: which model did you use?

Claude Opus 4.7 (Agent mode)
GPT 5.5 (Agent mode)

For AI issues: add Request ID with privacy disabled

Request ID: a22ba5d9-d11c-4625-b3d8-0345b7c60793

Additional Information

Environment specifics:

  • Remote workspace: Linux 5.10.134-19.1.al8.x86_64 (Aliyun)
  • Shell locale on the remote: LANG=en_US.UTF-8, all LC_* = en_US.UTF-8
  • JVM on the remote: file.encoding=UTF-8, native.encoding=UTF-8, sun.jnu.encoding=UTF-8

Workaround currently in use: routing every non-ASCII edit through the Shell tool with a python3 heredoc that does explicit write_text(..., encoding="utf-8"). This is reliable but completely defeats the purpose of having dedicated file-edit tools.

Suspected root cause: somewhere on the agent’s tool-call serialization path, a string→bytes conversion is using the platform default charset (which is GBK on Chinese-locale Windows hosts) instead of UTF-8. Likely culprits: String#getBytes() with no charset arg, Buffer.from(str) without ‘utf8’, Encoding.Default in .NET, or FileWriter / FileOutputStream(... , Charset.defaultCharset()). Forcing all agent-side file IO to StandardCharsets.UTF_8 should fix this.

Does this stop you from using Cursor

Yes - Cursor is unusable

A post was merged into an existing topic: Spending tokens to repair UTF-16 that should have been UTF-8