Blog

How I (almost) fucked up half a day's work

2023-02-09

A little feedback on a big blow that happened to me this morning ... and on how I ended up solving it. This is not a usual making-ofs article, but I think that this experience could be useful to others.

The context

I already talked about it, the logical content of the game is written in YAML files: as there are a lot of them with lots of possibilities of mistakes, I implemented a small script which reads the content of the game and “validates” it according to a set of rules.

Basically, I test a lot of things like:

if a level needs a “bottle” object, check that data/objects/bottle.yaml exists;
verify that the skin of this object exists;
for each action on the object (picking it up, looking at it), check that the script calls functions that exist and that are well-formed (example: the function talk: [] which takes 2 parameters, the character speaking and sentence spoken);
for spoken sentences, check that the translations exist for supported languages;
etc.

Some tests are done at runtime, but for the vast majority, it makes more sense to perform them in a separate script, for the simple reason that the content of the game will not change from one run to another, and that it is not necessary to test everything everytime the game is loaded: validating the levels once before the distribution of the game is enough.

Aaaaaanyway.

It turns out that, recently, quite a few things have changed within the game engine and the game data formats. The validation script was thus outdated (it no longer tested the right things, and it did not test at all some recent features).

As the script was starting to get a little messy, I spent a few hours reorganizing it, updating it, cleaning it up, and making it functional again. It was long, it was a pain, but in the end it turned out great, it worked well, and I was happy.

The fuck-up

The very first and arguably most important screw-up: I didn't commit the changes. I was busy running the script and correcting the levels according to the errors found, and I forgot. Big mistake.

A few days pass, I forget about it and work on other things. At some point, I implement a feature to test something in the game, but am not sure I want to keep it. What do I do? I figure for now I'll store it on the remote Git repo, and erase it locally. I commit it, I push it ...

And there, second huge screw-up: to get my branch back to its state before this commit that I don't want ... I do a git reset --hard [the_previous_commit]. I usually use this command which is great for cleaning up the branch and reseting the branch to a given commit.

Except that ...

Except that the changes I made to my validation script were not committed. So they weren't “registered” anywhere, neither locally nor on the remote branch ... and they were part of the stuff that got “cleaned up” by my git reset --hard.

I only realized this a few days later, at a time when, having modified a few levels of the game, I wanted to run the validation script again ... and realized that it had reverted to the old, obsolete version.

Honestly, at that moment, I understood the problem pretty quickly, and I was on the verge of tears and of a nervous breakdown. I knew for a fact that I fucked up real good and that I could only blame myself; I spent several hours, easily half a day on updating this script, all lost when I already have a tight schedule; I understood that I was going to have to do it all over again, a task that was long, that was a pain, and that I was probably going to need as much time, making the same mistakes again, etc.

The miraculous rescue

I searched a bit on the internet how to recover data after such a screw-up, but to no avail: even the famous git reflog, which can save a lot of stuff, can't do anything as the lost changes were unstaged.

And then suddenly, I remember that a few years ago, I managed to recover a lost text with a simple command ...

Yes, because you probably know this: when you delete a file, the contents of the file are not “really” deleted from memory, only the indexing of the file is deleted: the memory previously occupied by your file is simply considered “free” or “available”, but as long as nothing has been written to it, the content remains unchanged. Also, with journaling and redundancy mechanisms, the content of a file can end up being copied to several places on the drive.

Well, I didn't strictly speaking delete the file, it was Git that replaced the content with an old version, but I figured that internally, it must work similarly, so I gave it a try.

The magical command? Here it is :


$ grep -a -C 2000 "global_objects\.add" /dev/sda2 | tee recovered_data

Yes, it's just grep. The magic of GNU/Linux: no need to install complex and annoying data recovery software, it's just a simple command available everywhere.

Some explanations:

grep, as you may know, simply searches for a string in a file, and return it with some lines of context if neede;
-a is an option which means we want to treat binary files “as if” they were text files: normally the command only works on text, but if you want to search for a string in the bytes of a JPG image, you can;
-C 2000, “C” for context: if the string is found, I want grep to return the 2000 lines that “surround” the line where the string appears. As my script was around 1000 lines long, I used a wide range. You can also play with -B (for before) and -A (for after) to have a non-symmetric context;
"global_objects\.add": a string of characters that I remembered was present in the script. If you lost a text, try to remember a specific title or sequence of words. In my case, I remembered that at some point I was filling out a set named global_objects;
/dev/sda2: the magic thing is that grep can run on a full partition! Here, the partition that corresponds to my home folder;
tee recovered_data: this is optional, but it allows, when the string is found, to both display it and store it in the recovered_data file.

Simply put: we tell grep to scan the entire hard drive (or partition) as if it were a big text, and if it ever finds the string global_objects\.add, show us the 2000 lines of text surrounding it.

So of course, it takes a little while, as the hard drive has a capacity of about 800GiB ... but after about twenty minutes ... bingo, the script appears. Well, it's surrounded by a little mess of bytes that don't mean anything (again, I used a wide range with 2000 lines of context), but it's quickly removed and I got my beautiful lost file, undamaged. Saved half a day's work with a little grep.

Joy, happiness, relief. Oh, how I love GNU/Linux and its little tools that can work miracles.

Conclusion

I'm not going to tell you to always make backups or to be sure to commit your changes: that's what I do, I'm careful, all the source code is well archived on Git, I make backups regularly ... but there you go, a bit of fatigue, a bit of inattention, and a fuck-up can happen quickly. And sometimes, when that happens, it sucks a lot. I looked at all my backups, none was dated between the update the script and the overwriting.

Just be aware that if you lose a text file, and you realize it relatively quickly (after a while, the memory block will eventually be overwritten, of course): you can recover it with a simple grep. It's easy, it's fast, it works. It can save your life.

(And if you lose more complicated things like images, there are also softwares to recover them, but it is of course less easy.)

That was my little feedback on the emotional rollercoaster of the day. Now that it's said, I'm going back to testing my levels (and yes, of course, upon recovery, I committed that damn validation script).