When running an open world server like kadcon.de you always run into problems no one else encounters or bothers to fix, as they either happen too infrequently or your gameworld is easy to fix (minigames).
Unfortunately with 100+ users on the same Minecraft process and a modifiable world these problems often have a bigger impact. There were several problems in the past, so lets just start with one I recently found.
While upgrading the servers to 1.9.x (Spigot) there were several problems in regards to the chunk saving/loading. After fixing several of those bugs there was still one left:
About every 2-5 days some chunks in the world would be re-generated, as if they were never loaded before. Everything the player built before was gone. Luckily this happened very infrequently (<1 Chunk a day) and with good backups recovery was no problem.
But on a survival server this a critical and game breaking bug. Either you have to waste hours of your day recovering plots from backups or tell the owner of that plot "yeah...sorry, hope you still keep playing on our server <3" and rightfully get flamed on the forums in the end.
So there was not really an alternative to fixing this bug myself as it affects all my current and future projects.
Approaching an unknown bug
As you might imagine this bug is tricky. We have nearly no information, we have no way to reproduce the bug and it happens so infrequently we cannot actively test for it. Additionally the bug seems to only occur on highly populated servers - so testing on a test-server might be out of the question too.
In situations like these the only thing you can start from is your knowledge about the code and where this kind of problem could possibly occur. After searching for some days and fixing possibly related bugs (SPIGOT-2303,SPIGOT-2315) I found something suspicious in the RegionFileCache Minecraft uses to load *.mca files from the disk. If a limit of 256 open files would be reached it would clear the cache and re-open all the files it would need. This seemed like a very good candidate to fail very infrequently and only on a high playercount as 256 open *.mca files equals to an area of 8192x8192 blocks in minecraft - a huge amount.
I started to adding debug codes. I knew that:
- The bug had to generate a new chunk
- The bug would occur again on the live-servers
- The bug would probably happen directly before/after calling the cleanup code in RegionFileCache
Two days later a user reported that two plots were regenerated. Fortunately my debug-codes were at the right places and validated my assumption. The file-cache was cleared a millisecond before generating a new chunk exactly at the location reported by the player.
[Server]: DEBUG: RegionFileCache is full! Cleaning up 1465413562217 [Server]: DEBUG: RegionFileCache is full! End 1465413562218 [Server]: ---------------DEBUG START--------------- [Server]: Debug: Generating [-80, -132] in world world [Server]: java.lang.Thread.getStackTrace(Thread.java:1552) [Server]: net.minecraft.server.ChunkProviderServer.originalGetChunkAt(ChunkProviderServer.java:158) [Server]: org.bukkit.craftbukkit.chunkio.ChunkIOProvider.callStage2(ChunkIOProvider.java:43) [Server]: org.bukkit.craftbukkit.chunkio.ChunkIOProvider.callStage2(ChunkIOProvider.java:1) [Server]: org.bukkit.craftbukkit.util.AsynchronousExecutor$Task.finish(AsynchronousExecutor.java:188) [Server]: org.bukkit.craftbukkit.util.AsynchronousExecutor.finishActive(AsynchronousExecutor.java:347) [Server]: org.bukkit.craftbukkit.chunkio.ChunkIOExecutor.tick(ChunkIOExecutor.java:57) [Server]: net.minecraft.server.MinecraftServer.D(MinecraftServer.java:785) [Server]: net.minecraft.server.DedicatedServer.D(DedicatedServer.java:403) [Server]: net.minecraft.server.MinecraftServer.C(MinecraftServer.java:704) [Server]: net.minecraft.server.MinecraftServer.run(MinecraftServer.java:603) [Server]: java.lang.Thread.run(Thread.java:745) [Server]: ---------------DEBUG END-----------------
With this information it was clear that there was a problem with the code. The search continued for a while but in the end I found the problem. The backgroundservice 'FileIOThread' gradually saves already unloaded data to the disk. This process accesses the RegionFileCache asynchronously. In addition the Minecraft process itself accesses RegionFileCache quite often to validate a part of the map is generated. Two methods in RegionFileCache were giving out references to files and you might imagine doing this in a multithreaded environment is not the best idea.
By forcing minecraft to clear the cache very frequently I was able to reproduce the problem. After that the fix was easy: Change the implementation of RegionFileCache to no longer hand out file-references but the requested data directly.