Where does the bug appear (feature/product)?
Cursor IDE
Describe the Bug
Cursor’s agent file-edit tools (Write and StrReplace) silently encode non-ASCII content as GBK bytes when writing to disk, even though every relevant encoding setting (OS locale, JVM defaults, Cursor user files.encoding, workspace .vscode/settings.json) is explicitly set to UTF-8. As a result, agent-generated source files (Java, XML, YAML, properties, Markdown, etc.) cannot be compiled or parsed by standard toolchains that default to UTF-8.
Importantly, applying StrReplace with a pure-ASCII new_string to an existing UTF-8 file does not corrupt the file’s existing non-ASCII bytes. So the corruption is isolated to the path that serializes tool-call arguments containing non-ASCII characters into bytes on disk, not the file-read path.
Steps to Reproduce
-
Open a remote SSH workspace on a Linux host with
LANG=en_US.UTF-8and JVM defaultsfile.encoding=UTF-8. -
In Cursor user settings, set
files.encoding=utf8andfiles.autoGuessEncoding=false. -
Add
.vscode/settings.jsonto the workspace with the samefiles.encoding=utf8and per-language overrides for java/xml/yaml/properties/sql/markdown. -
Restart Cursor so the settings are applied.
-
In an agent chat, ask the agent to use its
Writetool to create a small file (any extension:.java,.md,.txt,.xml) whose content contains the three CJK Unified Ideographs at codepoints U+9500, U+552E, U+5355. -
After the agent writes the file, in the same shell run:
python3 -c "import pathlib,sys; raw=pathlib.Path(sys.argv[1]).read_bytes(); \ print('hex:', raw.hex(' ')); \ print('utf8 hit:', raw.find(bytes.fromhex('e9 94 80 e5 94 ae e5 8d 95'))); \ print('gbk hit:', raw.find(bytes.fromhex('cf fa ca db b5 a5')))" /path/to/file -
Observe that the UTF-8 byte sequence is NOT found, while the GBK byte sequence IS found.
-
Repeat the same procedure with
StrReplace, providing anew_stringthat contains those three codepoints. Same result: GBK bytes on disk. -
Control: in the very same shell, run a
python3heredoc that doesPath(...).write_text(text, encoding="utf-8")with the same content. The on-disk bytes are correct UTF-8 (e9 94 80 e5 94 ae e5 8d 95). This rules out the OS, shell, locale, JVM, and filesystem. -
Control: apply
StrReplacewith a pure-ASCIInew_stringto a file whose existing content includes UTF-8 non-ASCII bytes. Verify the existing non-ASCII bytes are preserved. This isolates the bug to tool-call argument serialization, not the file-read path.
Expected Behavior
When the agent’s Write or StrReplace tool is given content containing characters outside the ASCII range, the bytes written to disk should be the UTF-8 encoding of those characters (e.g. e9 94 80 e5 94 ae e5 8d 95 for U+9500 U+552E U+5355), regardless of the OS locale on the local Cursor host. The IDE setting files.encoding=utf8 should govern the agent write path, or the tool should hard-code UTF-8 since UTF-8 is the universally expected default for source code and text files.
Screenshots / Screen Recordings
Operating System
MacOS
Version Information
Version: 3.2.16 (Universal)
VSCode Version: 1.105.1
Commit: 3e548838cf824b70851dd3ef27d0c6aae371b3f0
Date: 2026-04-28T21:07:47.682Z
Layout: editor
Build Type: Stable
Release Track: Default
Electron: 39.8.1
Chromium: 142.0.7444.265
Node.js: 22.22.1
V8: 14.2.231.22-electron.0
OS: Darwin arm64 25.3.0
For AI issues: which model did you use?
Claude Opus 4.7 (Agent mode)
GPT 5.5 (Agent mode)
For AI issues: add Request ID with privacy disabled
Request ID: a22ba5d9-d11c-4625-b3d8-0345b7c60793
Additional Information
Environment specifics:
- Remote workspace: Linux 5.10.134-19.1.al8.x86_64 (Aliyun)
- Shell locale on the remote: LANG=en_US.UTF-8, all LC_* = en_US.UTF-8
- JVM on the remote: file.encoding=UTF-8, native.encoding=UTF-8, sun.jnu.encoding=UTF-8
Workaround currently in use: routing every non-ASCII edit through the Shell tool with a python3 heredoc that does explicit write_text(..., encoding="utf-8"). This is reliable but completely defeats the purpose of having dedicated file-edit tools.
Suspected root cause: somewhere on the agent’s tool-call serialization path, a string→bytes conversion is using the platform default charset (which is GBK on Chinese-locale Windows hosts) instead of UTF-8. Likely culprits: String#getBytes() with no charset arg, Buffer.from(str) without ‘utf8’, Encoding.Default in .NET, or FileWriter / FileOutputStream(... , Charset.defaultCharset()). Forcing all agent-side file IO to StandardCharsets.UTF_8 should fix this.
Does this stop you from using Cursor
Yes - Cursor is unusable
