The Legend of #155

*tap* *tap*

Is this thing still on?

It's been a while. Over two years, to be precise. Which, coincidentally, happens to be about the age of the oldest open ticket on my issue tracker for my freelance client. But don't let that fool you: this particular thorn in my side has existed effectively since the day I started with them, way back in 2016. This is the story of issue #155.

Some Background

The client in question operates a heavily customized Minecraft server network. When I first started with them, my role was to migrate the existing software suite from the Bukkit API (a modified version of the Minecraft: Java Edition server software) to the then-somewhat newfangled SpongeAPI (another modified version of the server software). This software suite included a system that provides the ability to show players a custom interactive GUI through the hacky use of inventories, known as menus. Basically, each "button" was actually an in-game item with a listener for click events hooked up to it, so that the software could execute the action attached to it. These actions were defined using a separate action system capable of parsing different types of actions from a user-specified configuration (e.g. run a command, give a player money, open another menu). All of this comes together to allow for completely data-driven user interfaces.

When all the porting was said and done, reports of an elusive issue regarding our custom menus started coming in. Once in a while, players would experience an glitch where menus would break in bizarre ways. The player could open the top-level menu and click on a button to open a sub-menu, but then everything would go haywire. Buttons would execute the wrong actions, or they'd do nothing, or in some cases, the player could pick them up and move them like normal items! Furthermore, the only way to resolve the issue was for the player to log off and log back on. But with no real leads on reproduction steps, there was very little that could be done without grasping at straws. I attempted several speculative fixes that involved things like delaying the action execution (we had encountered weird inventory issues before caused by various bugs in Sponge regarding timing), but to no avail. Thankfully, it occurred infrequently enough that it was able to eventually slip to the backburner.

Fast-forward about a year to November 2017, and a potential pattern to what was now christened issue #155 has been discovered. Following a player report suggesting that the issue is somehow related to dying in-game, we worked out that it occurs consistently every time a player dies. We've found our first clue, but without any other leads on where to start looking and with further investigation still proving fruitless, it's eventually shelved again to make time to address more important issues and features. It's still occurring infrequently enough that it's relatively low priority.

Renewed Interest

Disclaimer: From here forward, this post is going to become much more technical.

A few more months go by, and it's now March 2018. The number of issues on the tracker has dwindled quite a bit, and I decide to revisit #155 with some new ideas on how to approach it. I develop a small mod which injects logging code into Minecraft network manager and dumps information on all incoming and outgoing packets to disk. I run two trials: first, I interact with the menu interface as one normally would. Then, I kill my player, respawn, and try it again. I'm left with a sequence of packets for each trial and I get to work inspecting them.

Minecraft specifies two packets called SPacketOpenWindow and SPacketCloseWindow. SPacketOpenWindow is sent from a server to a client to instruct the client to display an inventory. SPacketCloseWindow is sent to the client to indicate a particular inventory should be closed. Both packets contain (among other information) a "window ID," a semi-unique number defined by the server for the purpose of identifying specific windows. This ID is generated by the server when opening an inventory, and the client is informed of it via the SPacketOpenWindow packet. This way, the client and server agree on which window they're talking about when communicating updates. When the server needs to instruct the client to open a window, it first sends SPacketCloseWindow to close the current window, then SPacketOpenWindow to open the new one. Keep this in mind, because it's important to understand the significance of the following observation.

While inspecting the packets for each trial, I noticed a discongruity. In the first trial, the server first sent SPacketCloseWindow with the window ID 0 (this is a magic value indicating the player inventory). This makes sense, since no external inventory was open and thus the active inventory was the player's own. Then, the server sent SPacketOpenWindow to open the menu with window ID 1, since it is the first external inventory being shown to the player following login. When I clicked a button to open the next menu, the server sent SPacketCloseWindow with window ID 1 to close the current menu, then SPacketOpenWindow with window ID 2 to open the sub-menu.

In the second trial, the server largely followed the same procedure, with one notable difference: the second SPacketCloseWindow had window ID 0 instead of 1. This means that the server was instructing the client to close the player inventory instead of the initial menu. It was obvious at this point what the immediate cause of the bug was: because the client was never instructed to close window 1, it technically still had it open. This would explain the buttons failing to function on the second menu, since the client became confused about which inventory was actually active and thus the server-bound packets corresponding to buttons being clicked contained incorrect information. (Note: I'm not terribly familiar with the clientside code, so the exact behavior is my inference - I'm not sure what precisely is happening under the hood.)

While the direct cause was clear, what was less obvious was why this was occurring. I did a bit of preliminary investigation into the inventory close routine on the serverside, but at this point I had been working for several hours with fairly little to show for it. I suspected that it had its roots somewhere deep in Sponge's inventory API, which has historically been a source of strange and obscure bugs. In this case, it would likely be several more hours of inspecting Sponge's inventory code, which tends to be obscenely complex and difficult to understand. After consulting with my client, we again shifted #155 to the backburner since it was increasingly becoming both a time- and money-sink, and was still a relatively minor problem. We spent the next several months quietly praying that Sponge would patch it, revisiting the issue only a couple times, briefly, between then and the present day.

The Crusade

A year and a half later, it's now September 2019. Issue #155 is now over two years old, but initial reports go back almost three years. The tracker is mostly clear of major tickets, with those still open pending QA testing. I'm looking through the backlog for something to work on since I have several hours to spare, and settle once again on #155. At this point, I'm determined to squash it once and for all.

Unfortunately for me, I discovered that I had somehow lost the the original packet logger, presumably between OS reinstalls (how I failed to commit it to source control is beyond me), so a rewrite is in order. I run my trials again and replicate the same results as from the previous year, with SPacketCloseWindow containing the wrong ID of 0.

I add some more logging code to print the current window ID at various points in the process of closing and opening menus, and another incongruity becomes apparent: the window IDs for the two menus are no longer sequential. The first has an ID of 1, as expected, but the next has an ID of 4. The next ID is determined based on the value of a field in the player class, EntityPlayerMP, called currentWindowId. When the server is preparing to open a window on a client, it first increments this value via the getNextWindowId method in the same class. As far as I can tell this is the only place where currentWindowId is updated, so I inject some logging code into it to figure out what's causing the extra two increments. To my surprise... it's only called once between menus. I start searching the Minecraft source to see if there's another routine that updates it that I somehow missed. I search through Sponge's code as well as Forge's, but find nothing. I'm at a loss.

Unless... the two values of currentWindowId aren't coming from the same variable at all. That is, they're coming from different instances of EntityPlayerMP. This is a bit of a long shot, since there's no good reason for two instances of EntityPlayerMP to exist when only one player is online, but it's the only explanation I can come up with. I'm aware that the instance of this object for a player is deleted and re-created after the player dies, so it's possible that the old instance is being cached somewhere without being updated properly on death. Some additional logging confirms that the two calls to the inventory opening routine are operating on different instances. The question is, why?

Debugging where the old instance is coming from isn't quite as straightforward as you might expect. Due to the design of our software, execution passes through three different plugins when translating from an inventory click to a new menu being opened. First, the menu plugin receives the click event, looks up the relevant button, and executes the action attached to it. The action execution is handled by a separate core framework. The action in question happens to be a command execution, with the command being defined by a third plugin responsible for loading and building the custom menus (through the menu plugin), and also provides commands to open specific menus as defined by a configuration. This third plugin intercepts the command, determines what menu is attached, and instructs the menu plugin to open it. Confused yet? Because I sure was. To make matters worse, this execution path passes through schedulers at multiple points, rendering a stack trace at any given point woefully incomplete.

To make a long story short, I was able to narrow down the changeover from the correct instance to the invalid one to a call within the action execution routine. (This took longer than it normally would since many of our method names are legacy and thus not entirely true to their actual functionality, which caused me to overlook the culprit call.) Essentially, while the action was being executed, the EntityPlayerMP object was converted to a User object defined by the core framework, with the object being retrieved from a cache populated when the player joins the server. This User object is then passed to another method which translates it back to an EntityPlayerMP (an example of the implementation outgrowing the API, but the API staying fixed for compatibility). The translation back to an EntityPlayerMP is performed by a method in User called getPlayer which returns the EntityPlayerMP stored in its state when it's first created. So, the User dutifully returns the EntityPlayerMP which has most definitely not been updated upon the player's death and is thus now invalid.

For those not familiar with Minecraft modding, storing player objects as persistent state is a huge no-no for pretty much this exact reason - the canonical way is to store the player's UUID instead and look up the player object as it's needed. I'm not sure how I overlooked this while performing the initial port to Sponge, but this was both a huge moment of clarity and a huge moment of disappointment that it was something so simple and obvious. I'm not sure why this wasn't an issue before we moved to Sponge - my best guess is that it was triggered by a change to how Minecraft handles inventories. In any case, it's not worth the inevitably huge amount of time required to figure out exactly why.

After spending about 5 minutes reworking the User class, I tested it out and sure enough, the bug was gone. Three years of waiting and grasping at straws, all for that.

This is an extreme example of a common phenomenon in software development, wherein a seemingly bizarre and obscure bug can be traced back to just a few lines of innocuous-looking code. In this case, it was a cascading failure, where a small piece of invalid state cascaded into a massive desynchronization between server and client. This case is particularly funny to me, because while #155 was largely disregarded for the time that it was open, it was always looming on the back of everyone's mind over the years. Every time I opened the issue tracker I was greeted by it, sitting down at the very bottom of the page. And yet, after all that, it was patched out with a ~20 line commit and no fanfare.

Posted by Max Roncace on