Blog

Testing a Robust Netcode with Godot

2024-10-23

The biggest challenge I faced in developing Little Brats! was the online multiplayer part: synchronizing computers with sometimes consequent latency while maintaining the “fast-paced action game” aspect was far from simple. I'll tell you all about it!

Lag compensation, prediction/reconciliation, etc.

I'm not going to do a detailed tutorial on these points, as there are tons of them already, but to give you an idea of the principle: when a client computer performs an action (in my case, for example, pressing the button to slap another kid), the server will receive this action, calculate what's going on, and send the result back to the client...

The problem is that even with a slight latency between the two computers, say 10ms, you end up with 20ms between pressing a button and receiving the result. It may not sound like much, but if you put a 20ms delay between each press of your keyboard keys and the execution of the resulting action, you're going to lose your mind in no time.

In principle, this is compensated for by several techniques: in my case, the client “validates” the action performed by default and applies it in its local scene. This is a prediction. When the server receives the action, it rewinds the game by the duration of the latency (so, for example, 10ms backwards), applies the action, and runs the game universe again for the equivalent of 10ms, all in the background, without this being visible on the server's game. Then it sends the final state to the client, which when it receives it will either validate its own state if its prediction was correct, or correct it if the server returned a different result (this is called “reconciliation”).

I'm not going to lie to you: it's very VERY complicated to implement all this reliably and “invisibly”. In other words, the people playing the game have to be as unaware of it as possible, and there has to be no camera “jumps”, inconsistencies, weird stuff... In practice, this will always happen a little, of course: if your client had calculated that you'd managed to hit a kid, but on the server that kid had already moved out of the way, well, you'll see your action “cancelled”, and the kid will finally go as if nothing had happened.

It was because of a delay on this network aspect that I was forced to postpone the release by 2 weeks, because my code was far too unstable, and in some cases the game was left broken or in strange, buggy states. Anyway.

The thorny question of testing

Obviously, the best way to test a multiplayer game is with several people. But in the meantime, when you're in the middle of development and incrementing little by little, it's still necessary to test “alone” behind your PC.

Of course, you can have several machines to test on, but it's still relatively tedious, especially if you have to update the game on all machines every time you modify a line of script.

In short, the easiest thing to do is to run two or more instances on your own PC and create a game on localhost. Godot makes it very easy to launch several instances while keeping the debugger open on each of them, which is extremely handy. All this already makes it possible to test good communication between several instances, and that's no mean feat. Except, of course, that latency between two instances on the same machine is very, very low, and communication between the two instances is 100% reliable (we'll never “lose” a packet within the same computer): we're not at all in real network conditions.

This is where a handy command (on GNU/Linux) comes in: tc, for traffic control settings. Let's take a look at this command:


# tc qdisc add dev lo root netem delay 50ms loss 1%

This command, run in root, “artificially” adds 50ms of latency locally, as well as artificially losing 1% of packets in transit locally. In this way, we can simulate more or less degraded network conditions while remaining on a single machine, and thus test the robustness of our network code. Pretty handy, don't you think?

To remove this artificial degradation, simply do:


# tc qdisc del dev lo root

Godot, reliable/unreliable

Godot provides a high-level network API that abstracts from low-level network protocols such as UDP or TCP. Here, Little Brats! uses the ENetMultiplayerPeer class, which uses the ENet library, itself based on UDP.

To explain the difference between UDP and TCP, take a look at one of the many memes on the subject:

TCP/UDP

Basically, TCP transmits packets reliably but slowly, and UDP unreliably (packets can be lost) but quickly. With its high-level API, however, Godot lets you choose the reliability mode:

reliable: guarantees the arrival of all packets in the order they were sent. So we have an overlay to UDP where there's an internal mechanism that checks the reception of packets and resends them if reception fails. This can obviously be slower, especially on a degraded network with many losses. This gives us a kind of TCP equivalent, but on a UDP basis.
unreliable: packets can be lost, a kind of “raw” UDP mode.
unreliable_ordered: still unreliable, but at least the order of arrival of packets is guaranteed (a mode I've never used myself).

In practice, how does this work when the game is running? Well, I thought I'd set up a few tests to measure this.

I've set up a simple program that simply calls a remote function (via an rpc call in Godot) every 4 frames, 50 times in a row. On the client side, we note, for each frame, how many times we've received this call. The following diagrams show the number of packets received at a given time (and, for correspondence, the packets sent by the server below).

(If you're interested, you can download the Godot project to run the tests yourself.)

Obviously, if we have a perfect network (no latency and no packet loss), we end up with the server's sends almost perfectly synchronized with the client's receives, whether in reliable or unreliable mode:

0ms

If we add a little latency (50ms), we can see the time lag between the two “combs”. On the other hand, reception remains more or less regular, and once again, there's no difference between reliable and unreliable:

50ms

Of course, the difference in behavior lies in the addition of packet losses. Here, for example, is the effect of a 1% loss rate when using the unreliable mode:

1pct loss, unreliable

And if we push to 5%:

5pct loss, unreliable

We can see that some calls are lost. These are much more than 1% or 5% of lost calls, because a call is made up of several packets (and it only takes one lost packet to cause the entire function call to be lost). On the other hand, for packets actually received, there's a good degree of regularity.

What's interesting is what happens in reliable mode with 1% loss:

1pct loss, reliable

And with 5%:

5pct loss, reliable

Do you understand what's going on? When a call gets lost, Godot will try to resend it until it gets through... I don't know the implementation details, but I imagine that there's some sort of packet indexing and that, on the client side, Godot waits until it has received all the packets in order before calling the functions.

As a result, when a packet is lost and the loss is “fixed”, all the late calls are received at once! This method is therefore very effective in ensuring that no function calls are lost... but it does come at a cost: some packets may arrive very late, delaying subsequent packets!

In practice, I use unreliable mode when the server sends the state of the game to clients: in this case, if a state is lost, it's not a big deal, but it's more interesting for the client to have the next state “well synchronized” than to receive several states at once.

I use the reliable mode for sending client inputs to the server: the server needs to be able to recalculate the state of the game reliably, and it's not acceptable for some client inputs to get “lost”. This may cause a bit of latency, and a bit more work for the server, which will have to “rewind” the game a bit further if an input arrives very late, but that's the price to pay for a stable game.

And of course, it goes without saying that reliable mode is used for everything that requires a guarantee of reception: opening communication between server and client, sending signals such as game start, stop, score, etc.

Going even further?

We already have a good basis for testing with tc, but we can do better (or worse, depending on your point of view): in practice, the quality of a connection between two computers can vary over time (a network that becomes congested, someone playing on a phone in transit, etc.). What happens if, all of a sudden, the latency of one of the computers increases from 15ms to 50ms? If we start losing 2% of packets instead of 0%?

To test this, I've set up this little script, to be run in root while the game instances are running.


#!/bin/bash

while true; do
  delay=$((RANDOM % 91 + 10))
  loss=$((RANDOM % 4))
  interval=$(awk -v min=2 -v max=5 'BEGIN{srand(); print min+rand()*(max-min)}')
  echo "Real network simulated with: delay=${delay}ms loss=${loss}%"
  tc qdisc add dev lo root netem delay ${delay}ms loss ${loss}%
  sleep $interval
  tc qdisc del dev lo root
done

This script will modify network quality regularly (between 2 and 5 seconds, at random), adding a latency of between 10 and 100ms, and a loss rate of between 0% and 3%. So yes, this simulates a really variable and rotten network, but after all, if the game runs in rotten conditions, it should run all by itself in correct conditions!

I won't show you any graphs for this variant, as you'd have to leave it running for a long time to see anything, and the “comb” becomes too dense, but I think it's an interesting little piece of scripting.

Conclusion

Of course, none of this is a substitute for real multiplayer testing, with people connected in more or less distant cities. If only for the interaction aspect, if not for the network aspect.

But with a few tc commands, you can already simulate a real, imperfect network, and thus debug and fix a lot of things without leaving the comfort of your single workstation (well, provided you have enough RAM to run several instances of the game).

And for those of you who wanted to understand a little better the concrete behavior of Godot's reliable and unreliable modes, I hope these explanations and graphs have helped you.

See you soon for new adventures :)