Internet Explorer is not supported. Please upgrade to a more modern browser.
Hi everyone!
Today, I'd like to talk about the current situation regarding the major performance loss on our instances of SurviveTheNight.
This post will go a bit more into detail on what exactly is going on (technical side) but I'll try to explain it as simple as possible so as many people as possible understand what's going on.
⇒ The Past
As some of you may remember, before STN went into our 6-month maintenance, we've had a similar situation.
Instances were lagging as soon as they would boot up, we had to cap our dungeon instances to save resources and some players would even experience wipes/dupes upon switching instances due to lag and database desync.
Back then, most of the issues were caused by extremely non-optimized code running on our main lobbies.
Database connections weren't secure enough as well as running sync (on the main server thread) instead of async (on a separate thread to make potentially caused lag by them not noticeable) and the code itself just had a bunch of memory leaks.
⇒ Current Situation
This time, the lags are still caused by inefficient code, however it isn't any sort of memory leak (most likely) this time.
Some of you may ask: What is a memory leak?
Basically, any sort of system stores temporary data in the server's memory. It can then access the data from there or remove it and adding new data to it.
Once certain objects of the code saved in the memory are no longer being used and cannot be accessed, they will automatically be removed from the server's memory to make space for new stuff.
A memory leak is when the system keeps adding certain objects to the memory for future uses but never removes the old ones or still has references to the previous objects somewhere. This can happen easily in some sort of loop, where it keeps adding an object with each cycle but never removes the object from previous cycles.
This, effectively, ends up in stacking unused objects (that still have a reference somewhere and therefore non-removable objects) in the memory and consuming large amounts of memory. When the server is running low on memory, the performance goes conspiciously down.
However, this isn't the case as of right now (or atleast not the major reason why we are experiencing performance loss).
⇒ Garbage Collector
Our servers and plugins are running with Java.
Java has a built-in feature called "Garbage Collection".
As already mentioned when explaining memory leaks, objects without any reference or use are being removed automatically from the memory. This is exactly what the garbage collector does.
The garbage collector manages heap memory, this is the part of the memory that stores dynamically created objects by the code.
Heap memory is split up into 2 different sections:
The old generation: This section of the heap memory stores objects that survived a bunch of deletion cycles by the garbage collector. An example would be a static object in the code that contains the name of the plugin. This object would always exist and always be accessible during the entire runtime - would be bad if it wasn't right? haha.
The young generation: This section of the heap memory stores newly generated objects.
In fact, the young generation is split up even further, into 2 different spaces.
The Eden space: This space of the young generation contains objects that were just created. The garbage collector has not gone through them and checked whether they are still being used or not. Most of the objects in that space usually are objects that aren't being used anymore after an extremely short amount of time - so like very short-lived temporary objects.
The Survivor space: This space contains objects that survived atleast one cycle of the garbage collector, so objects that have been created recently and are still being used in the code even after one or multiple cycles.
As already mentioned, most of the objects from Eden space don't make it to the Survivor space.
Once the Eden space is full, a cycle of the garbage collector is being triggered to make space for new objects for the Eden space. During that cycle, the application is being paused for a split second, however you don't really notice that usually as it doesn't take long.
If an object survives a bunch of cycles while in the Survivor space, it's being promoted to the old generation where garbage collector cycles happen way less frequent due to all of the objects inside having longer lifespans (or atleast they are supposed to have longer lifespans).
Due to the fact that most objects don't make it to the Eden space, the size of the Survivor space is much smaller compared to the Eden space.
⇒ The Issue
Now that you know how the garbage collection works in Java, it's time to explain what's going on with SurviveTheNight.
It currently seems as if many objects created by our plugins are objects that survive atleast one garbage collector cycle.
Due to the Survivor space being much smaller compared to the Eden space, the Survivor space runs out of space quickly. Objects not fitting in the Survivor space are being promoted to the old generation immediately, which - as you may have guessed - is not good.
You may think now, why don't we just increase the size of the Survivor space?
While this may sound like a good idea, this doesn't work.
The size of the Survivor space is relative to the size of the Eden space. While we could increase the Survivor spaces size, this would decrease the size of the Eden space effectively leading into the Eden space filling up quicker and more garbage collector cycles occurring.
As well as that, most objects aren't supposed to end up being in the Survivor space, instead of increasing the size, we would have to check our code for longer living objects rather than temporary ones.
Another solution could be increasing the overall memory on the main lobbies.
This isn't a good idea either.
While this does technically increase the size of the Survivor and Eden space, this would also just end up consuming a bunch of memory for no reason from our virtual server we're running STN on.
The Survivor and Eden space don't take up much space of the memory compared to the overall allocated memory. We would literally have to end up allocating gigabytes of memory to even see a difference in size of these spaces. And most of that allocated memory would just be unused either way.
⇒ Conclusion
While we know what's going on now, we're still unable to fix it.
Unfortunately we couldn't exactly identify what objects are inside the Survivor space (or rather identify where they are coming from), so we're unable to improve our systems as of right now. Until then it's just a bunch of testing and disabling newly added systems hoping to see a difference and identify the system which is causing the lag.
As well as that, I don't have that much knowledge about how memory works myself. While I have a rough idea of what's going on, I've never ever had a situation or theoretical situation like this in my past projects nor my apprenticeship.
(I could also be wrong about some things in my explanation of the garbage collector, I'm not 100% sure about that either)
I'll keep you updated and once again, we're extremely sorry for the inconvenience.
~ zProxxy