Japanese Characters Occasionally Corrupted When Using "Apply" Button

I’ve encountered an issue when using the “Apply” button to display code change predictions. Occasionally, instead of showing the expected changes, the prediction incorrectly modifies unrelated parts of the code. Specifically, it sometimes converts portions of Japanese string literals into “��” characters.

This bug appears to be intermittent and doesn’t always occur. However, when it does happen, it affects areas of the code that should not be modified at all, rather than the sections where changes are expected.

Steps to reproduce:

  1. Open a file containing Japanese string literals in Cursor.
  2. Use AI chat or composer.
  3. Click the “Apply” button to generate change predictions.
  4. Observe that sometimes, instead of showing relevant changes, parts of Japanese strings are corrupted into “��” characters in unrelated areas of the code.

Expected behavior: The “Apply” button should only show predictions for relevant code changes and should not modify or corrupt existing Japanese string literals.

Actual behavior: Occasionally, clicking “Apply” results in corrupted Japanese characters (��) in unrelated parts of the code.

Additional information:

  • This issue occurs intermittently and is not consistently reproducible.
  • It only affects Japanese string literals, not other parts of the code.

CleanShot 2024-08-21 at 15.05.17

5 Likes

I also have a similar issue with apostrophes, it’s replacing type of apostrophe with a single quote when I hit ‘apply’, despite me not asking to edit that part of the code.

2 Likes

I have the same “���” issue with Version: 0.42.5.

1 Like

similar issue reported Weird ascii characters in prompt output from new claude-3.5 sonnet model

@admins pls take a look

Same here with Chinese Characters.

e.g. 日志管理 can be split into 日���管理

This issue might be related to the tokenizer splitting UTF-8 characters into multiple tokens, which are then streamed as separate messages and not properly reassembled by Cursor.

It has been a bit of a hassle, as I’ve had to manually fix the corrupted characters every time :joy:. It would be amazing if the team could prioritize this fix—it would save a lot of time and frustration!

Version: 0.44.8
VSCode Version: 1.93.1
Commit: f3b5a63019e4e2283033b4db987a35f8413c7570
Electron: 2024-12-22T05:48:08.427Z
ElectronBuildId: 30.5.1
Chromium: undefined
Node.js: 124.0.6367.243
V8: 20.16.0
OS: 12.4.254.20-electron.0

Same here with Korean letters.
Corrupted Korean values are suggested even if there’s no issue.

Version: 0.44.8
VSCode Version: 1.93.1
Commit: f3b5a63019e4e2283033b4db987a35f8413c7570
Date: 2024-12-22T05:48:08.427Z
Electron: 30.5.1
Chromium: 124.0.6367.243
Node.js: 20.16.0
V8: 12.4.254.20-electron.0
OS: Darwin arm64 24.2.0
2 Likes

Same exact issue as well, it is so wierd to see Chinese, Korean and Japanese are agreeing the same thing at the same time.

Many Chinese users reported this issue 5 months ago, no reply, no solution, the problem is caused by Cursor trying to match the generated result and swap the diff, some experienced coder said that Cursor trained a small language to do that? a unique text encoder? this is the culprit that caused all the issue! Please devs! Make this issue priority!!!

1 Like

Looks like it’s been fixed, hasn’t popped up again for a while.