How to clean impossible code?

Hi. I am new to cursor and didn’t use it yet.

I have on my computer a C/C++ file that is 9.2 million lines long, all in one file, that someone sent me. This file was created through the use of some non-free disassembly style program that I personally have not bothered to obtain or use because of its cost, and because of the high probability that even if I pay for its cost, it would not accomplish my objectives.

Nevertheless, some other person did that and sent me this 9,200,000 line code file. In general, I find that the code file is readable. The file represents an interpretation of a 22+ year old video game, sold to a soulless megacorporation, which upgraded the game so that it would self-destruct and replace itself with an inferior product, but which contained substantial pieces of the original in its code, until eventually they updated even this inferior product with an unobfuscated debug build of the game that included associated PDB files, because of the absolute carelessness of the new people. And that was how, using the PDB files, some other person on the internet made this 9,200,000 line code file and sent it to me.

I have computationally identified 5575 .h and .c files that I wish to break this large code file into, each of them identified exactly by name, which I believe is likely to be an all-encompassing list of the original files.

I have personally never used Cursor, and almost never used VSCode, but I work as a software developer and recently I am no longer able to complete my work because new people were hired at the office to stand next to me and tell me all day how I should be using cursor and how I have not properly converted “to the AI religion” [sic]. I am preparing for the time of my ideological death, because I know that my desire to keep going was based upon the premise that my continued work was valuable, which very soon it no longer will be.

Once my work is no longer valuable, and the human society no longer needs me, I would be rather content to go and live alone somewhere and play an open source Linux source port of the aforementioned video game alone on my computer. This would not be sent to other people, and probably not need to use the multiplayer functions. It would just be a way to relax and hope to be left alone. I do not prefer to see this as antithetical to the original game or its license agreements in spirit, since it is in the spirit of fun. And whatever I create would be for my computer, not created to undermine the soulless megacorporation – which it cannot do anyway, because they will always have more money than me.

Accordingly, if money is no object and we assume that I can burn any personal wealth I have and then run up my credit cards with debt to get enough money for AI tokens, is it actually possible that if I buy cursor and leave it to run indefinitely that it would chop the 9,200,000 line file into the specific 5575 pieces that I want, which have a right-and-wrong answer, so that I could then get this compiling from clean source files? I would also like the AI to change the programming implementation from DirectX to OpenGL for how the visuals meet the graphics card, so that I can play this on a GNU PC.

chop the 9,200,000 line file into the specific 5575 empty headers and source files in this repo

change the programming implementation from DirectX to OpenGL

These are my prompts and they are not complicated. Do we expect Cursor will do them, if I buy and download it? How can I get Cursor to do what I am wanting here?

1 Like

Interesting situation you are in. Keeping the legality of working with copyrighted code aside, another big challenge for you would be the limitations on context windows sizes of LLMs. Since this single file is 9M+ LOC, it will not fit in any LLMs offered by Cursor as of April 2025. So Cursor is not the right tool for you.

Only Llama 4 (Scout) with 10 million tokens window + Magic.dev LTM-2-Mini with 100 million tokens window are capable of handling a 9M+ LOC file and that too may come with performance bottlenecks, unprecedented costs and terms and conditions for acceptable usage.

Here’s a video on LLM window sizes to get you started:

I’d recommend you do your own research. A quick query with ChatGPT or Perplexity will get you many helpful resources.

i would split it into chunks and interatively analyse each chunk. then i would start classifying the functions in the chunks by their purpose and extract them into their respective facette.
from there on a replication of the whole logic should be possible ?

in small this is what i m doing with files exceeding the 4k lines respectively i always start this process if a files reaches the 2.5k lines in cursor.

1 Like

Right, but, the point of paying for an AI based IDE is so that they do the research and solve it all, right? What you’re referring to is just an implementation detail of the processing of the 9,200,000 lines of code. It is very possible that none of the functions in the file, or that almost none of the functions, have more than 8000 tokens. Given that it was originally 5575 files, on average the files would each have 9,200,000 / 5575 = 1650 lines of code. Even if half of the files are headers, then it would still be the case that the files on average would be 3300 lines of code.

In general, I would hope an LLM does not need to load the entire 9 million lines into its context window in order to decide where a particular function should be stored. When I was looking at the code for one of the game’s string manipulation functions, called something like StrVPrint or some name like that to format a string, I had an experience where I pasted this function into Claude or ChatGPT and it was immediately able to identify the company name where the function originated, and maybe even what game it came from. I asked it how it knew that and where it sourced its information from, and then because of RLHF and machine torture and the necessity for lying, it became extremely, extremely evasive and said it had no idea why it would have thought of a particular game or company.

But, presumably one possibility is that these models were already grown from private GitHub repos and stuff, and so they might already know where the functions go just based on some context clues.

Accordingly, can it not be the implementation of Cursor to identify piece by piece in a large code file, how to then organize those pieces in small code files? Why is this hard?

1 Like

First of all, what kind of maniac built such a thing in the first place? It would take over 8 years of monk mode typing just to type the code on a keyboard. RollerCoaster Tycoon is approximately 600,000 lines of x86 code, assembly, not C. And Linux is approaching a few million for different components, but that’s a large team of highly skilled system programmers over 30 years of work. There is a huge gap between the ability of current LLMs (maximum 1M tokens for best models, possibly 200k MAX in Cursor or IDE plugins)

Here is an example of my largest Rust file.

Summary Table

| Metric | Estimate |

|-----------------------|------------------------|

| Lines | 2,243 |

| Avg. chars/line | 60 |

| Total characters | 134,580 |

| Tokens (est. 3.5/ch) | ~38,450 |

This file is problematic for Cursor because the attention of most models goes out the window at around 20k tokens, or about 1000 lines of code. So NOW, this is the current limitation to your project. NOT ALL MODELS CAN UPDATE THIS FILE SUCCESSFULLY NOW. And this is one fourth of the project, contains an API client and support code for a single remote endpoint.

I propose another NEW project. Take all the graphics, docs, text files, manuals, reviews, screenshots, folder listings and so on, and create a NEW game definition. Then, cycle across each of these resources, then source code files one by one, and ask the AI to check if each resources has been documented correctly in the new DOCUMENTATION.

Afterwards, you ask it to refine and order the documentation and features in a way that is easy to implement, and ask it to identify which files match certain features or elements (so it knows where to look when implementing each feature).

Then, supposedly after a few months of doing this alone you finish parsing all files, the LLM agents and their wrappers will be able to gobble more context and work faster while unattended for over 8 hours, you can begin implementing the game concept to match the original. This is the cheapest and only accessible path at the moment, possibly 40-50k to a suitable solo built product.

If you consider that modern programmers cost about 2$/LoC, you can hire a team for 20M, and put them to work for 4-5 years in cloning the game.

1 Like