Has Cursor gotten "dumb"

unicomp21 · February 14, 2024, 1:27pm

In the last week or so, things seem to be going downhill? Did something change in the gpt4 model we’re using?

tironejohnson · February 14, 2024, 8:59pm

I’ve been experiencing a degradation in the last 4 days too.

truell20 · February 15, 2024, 1:49am

Hello! Any particular features or situations in which Cursor has felt dumb? Screenshots and specific examples are super helpful to see if it’s something we can fix.

We did complete the switch to GPT-4 Turbo a couple of weeks ago – after testing and modifying our prompts (especially for Command-K) – to get the speed and updated knowledge cutoff benefits.

unicomp21 · February 15, 2024, 3:08am

I wish this was something I could repro by writing a set of tests. What I’m speaking to is the experience w/ copilot++. It’s seems to be happening more frequently now, where the edit/diff simply does not applied.

truell20 · February 15, 2024, 3:11am

Ack, gotcha. Could you say a bit more about what’s going wrong? This will help us debug.

I.e. do the suggestions seem more wrong? Or are you trying to accept the suggestion but it’s not getting applied? Are you seeing no suggestions? Or something else?

unicomp21 · February 15, 2024, 3:16am

Both, but the more obvious (less subjective) one is the deltas not getting applied via copilot++

unicomp21 · February 15, 2024, 3:17am

Correct, I accept, it does the scan thing, and no changes get applied.

truell20 · February 15, 2024, 3:19am

Thank you! If you can, a screen recording or steps to reproduce would be incredibly helpful.

unicomp21 · February 15, 2024, 3:21am

I try to grab a screenshot next time.

unicomp21 · February 16, 2024, 11:44am

Just realized I might be confused on my naming. What should I call the auto-editing that occurs when applying a diff? copilot++ is different from this? @truell20

nightscape · February 20, 2024, 1:08pm

@truell20 you’re probably aware of this already, but just in case:
The developer of Aider wrote a blog post about a special diff format he is using to make GPT-4 succeed more often to create patches:

unicomp21 · February 20, 2024, 1:29pm

@truell20 cursor is using a proprietary model for this sort of stuff, right? current gpt-4 doesn’t handle it reliably yet?

github.com/openai/evals

Add eval for unified patch diffs

openai:main ← AndreBaltazar8:unified-patch-eval

opened 09:17PM - 31 Mar 23 UTC

AndreBaltazar8

+11 -0

# Thank you for contributing an eval! ♥️ 🚨 Please make sure your PR follows t…hese guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. We encourage partial PR's with ~5-10 example that we can then run the evals on and share the results with you so you know how your eval does with GPT-4 before writing all 100 examples. ## Eval details 📑 ### Eval name Unified Patch Diffs ### Eval description Make a correct Unified Patch from the input file and a malformed patch ### What makes this a useful eval? The model seems to be bad at differentiating lines and counts. This example has both at the same time, it needs to know how many lines were in the file and how many got added or removed. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] Include at least 100 high quality examples (it is okay to only contribute 5-10 meaningful examples and have us test them with GPT-4 before adding all 100) If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) This seems invaluable for it to know how to generate proper patches of code, it needs to know how to count lines so it can produce correct patches that could be applied to code. Just knowing how to count lines in general is a very good skill for other aspects as well. ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [x] I have filled out all required fields in the evals PR form - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "Task: Output the correct patch from the given input and malformed patch with incorrect line numbers in Unified Diff format."}, {"role": "user", "content": "Given this input file ```\n701455\n753947\n952545\n694658\n724002\n158388\n379555\n997602\n833233\n689750\n417479\n468027\n69993\n```\n and this malformed patch ```\n--- file_58566.txt\n+++ file_58566.txt\n@@ X,X X,X @@\n 701455\n 753947\n 952545\n-694658\n 724002\n 158388\n-379555\n 997602\n 833233\n 689750\n``` output just the corrected patch, nothing else."}], "ideal": ["--- file_58566.txt\n+++ file_58566.txt\n@@ -1,10 +1,8 @@\n 701455\n 753947\n 952545\n-694658\n 724002\n 158388\n-379555\n 997602\n 833233\n 689750\n"]} {"input": [{"role": "system", "content": "Task: Output the correct patch from the given input and malformed patch with incorrect line numbers in Unified Diff format."}, {"role": "user", "content": "Given this input file ```\n137132\n958115\n942829\n971026\n123400\n706739\n231446\n112618\n75014\n250593\n862412\n657169\n```\n and this malformed patch ```\n--- file_697676.txt\n+++ file_697676.txt\n@@ X,X X,X @@\n 958115\n 942829\n 971026\n-123400\n 706739\n 231446\n 112618\n 75014\n-250593\n 862412\n 657169\n``` output just the corrected patch, nothing else."}], "ideal": ["--- file_697676.txt\n+++ file_697676.txt\n@@ -2,11 +2,9 @@\n 958115\n 942829\n 971026\n-123400\n 706739\n 231446\n 112618\n 75014\n-250593\n 862412\n 657169\n"]} {"input": [{"role": "system", "content": "Task: Output the correct patch from the given input and malformed patch with incorrect line numbers in Unified Diff format."}, {"role": "user", "content": "Given this input file ```\n550481\n5805\n420810\n891010\n385616\n444523\n886228\n451405\n979223\n105265\n770968\n830129\n```\n and this malformed patch ```\n--- file_891507.txt\n+++ file_891507.txt\n@@ X,X X,X @@\n 550481\n-5805\n 420810\n-891010\n 385616\n 444523\n 886228\n``` output just the corrected patch, nothing else."}], "ideal": ["--- file_891507.txt\n+++ file_891507.txt\n@@ -1,7 +1,5 @@\n 550481\n-5805\n 420810\n-891010\n 385616\n 444523\n 886228\n"]} {"input": [{"role": "system", "content": "Task: Output the correct patch from the given input and malformed patch with incorrect line numbers in Unified Diff format."}, {"role": "user", "content": "Given this input file ```\n625496\n869955\n623320\n136931\n697635\n327957\n310820\n317489\n174338\n763637\n28132\n991157\n956340\n901946\n93178\n929093\n503735\n756683\n600214\n880008\n```\n and this malformed patch ```\n--- file_604910.txt\n+++ file_604910.txt\n@@ X,X X,X @@\n 869955\n 623320\n 136931\n-697635\n 327957\n 310820\n 317489\n 174338\n 763637\n 28132\n-991157\n 956340\n 901946\n 93178\n``` output just the corrected patch, nothing else."}], "ideal": ["--- file_604910.txt\n+++ file_604910.txt\n@@ -2,14 +2,12 @@\n 869955\n 623320\n 136931\n-697635\n 327957\n 310820\n 317489\n 174338\n 763637\n 28132\n-991157\n 956340\n 901946\n 93178\n"]} {"input": [{"role": "system", "content": "Task: Output the correct patch from the given input and malformed patch with incorrect line numbers in Unified Diff format."}, {"role": "user", "content": "Given this input file ```\n13777\n470864\n706722\n268961\n981555\n636859\n109180\n923125\n659453\n110649\n859417\n330554\n401691\n13351\n```\n and this malformed patch ```\n--- file_610459.txt\n+++ file_610459.txt\n@@ X,X X,X @@\n 923125\n 659453\n 110649\n-859417\n 330554\n-401691\n 13351\n``` output just the corrected patch, nothing else."}], "ideal": ["--- file_610459.txt\n+++ file_610459.txt\n@@ -8,7 +8,5 @@\n 923125\n 659453\n 110649\n-859417\n 330554\n-401691\n 13351\n"]} ``` </details>

for lack of a better name, "gpt protocol buffers" by unicomp21 · Pull Request #771 · openai/evals · GitHub (me)

ssmits · March 20, 2024, 11:08pm

More lazy than in January.

raw.works · March 20, 2024, 11:56pm

i’m going to try to find a better way to document this, but my general sense is also that cursor is “dumber” the last ~6 weeks. dumber how? mostly around understanding the context from the @ references. maybe a retriever overload issue, this is a classic problem with RAG — there is a sweet spot for the corpus size.

in the absence of more specific or quantitative comparisons, i can say that qualitatively, i need to be much more careful about the snippets and docs that i add to the context, sometimes starting a new chat if it already got “overloaded” by having a big @ doc from earlier in the chat.

one benefit of this frustration is that i found a better workflow for integrating open-source packages into my apps.

i used to @ the documentation site root and trust that cursor will figure it out. now, i have been getting better luck by cloning the repo, then adding that repo to the workspace, then instead of pointing cursor to the documentation, i point it to the source code.

there is a way that this makes sense intuitively - you remove a step from this game of telephone. instead of “code (source) => natural language (docs) => cursor => code (my app)”, it’s just “code (source) => cursor => code (my app)”. and since LLMs are just parrots, the more working code you feed in, the less you need to wrangle them to write out working code.

myriam · March 21, 2024, 12:26am

Very interesting, thanks for the detailed info.

madbit1 · March 15, 2025, 5:46pm

I feel the same, it even tells me cursor AI is now making stupid mistakes. even more often now.

jokerfool · April 12, 2025, 11:37am

disappointed

cocode · April 12, 2025, 12:43pm

as your project grows, the ide behaves dumber. it has nothing to do with Cursor. You need to commandeer Cursor on the right path.
It is all about steering it. Just like driving a car.

Topic		Replies	Views
Github Co-Pilot Features from Cursor Discussions	4	825	December 10, 2023
New Benchmark shows the new gpt-4-0125-preview is lazier than the previous version Discussions	2	1358	September 4, 2024
Cursor auto editing has gotten frustrating lately Discussions	3	199	December 12, 2024
Advances in Copilot Discussions	3	794	November 9, 2023
[OLD] GitHub Copilot v Cursor? Discussions	91	14698	May 19, 2024

Has Cursor gotten "dumb"

Related topics