502 Bad Gateway error when uploading files solved

Some users have reported seeing a 502 Bad Gateway error when uploading files to the module library.

E.g.:

Thanks to some troubleshooting help from @Daniel_T_Shaw,
we’ve finally determined the cause of the problem and it is now fixed.

If you encounter any problem like this, please let us know and provide all relevant details. We cannot solve problems without you.

If you’re not interested in the technical details, stop reading here.

There are several version of the HTTP protocol that clients and servers use to talk to each other. While the HTTP/1.1 spec supports a server replying before reading all the data being sent by the client, most client implementations ignore responses which arrive early. I had not known this crucial fact about how HTTP/1.1 clients actually behave before Friday.

The Game Library Service, to which your browser uploads files, sends a 413 Too Large response as soon as it can determine that too much data is being sent. Most of the time it can do this before reading any of the data, by checking the Content-Length header of the request. The GLS is behind a proxy, which means that your browser is connecting to the proxy instead of the GLS, and the proxy routes traffic onward to the GLS. When the GLS sends back an error response before your browser is finished writing data and your browser ignores that response, your browser still has an open connection to the proxy and keeps sending data. When your browser finishes sending data, it waits for a reply. Because the browser has kept the connection open, the proxy expects that there will be more traffic—but there won’t be, because your browser has finished sending data and the GLS has already replied. The proxy has a timeout set for idle connections, and when a connection times out, the proxy sends a 502 Bad Gateway response and closes the connection. The timeout error from the proxy is what your browser interprets as the response to the request it sent. This is why when you try to upload an overlarge file in the library you see a 502 error after a long wait instead of a 413 almost instantly.

When we were troubleshooting 502 errors in May after switching to the new module library, I did not know that clients would ignore early replies, so the simplest explanation was that the proxy was timing out because the client hadn’t received a reply from the GLS. This is why I increased the proxy timeout several times—which it turns out was counterproductive, because it just made clients wait longer for the proxy to timeout.

Recall that I mentioned that there are multiple HTTP protocol versions. A significant difference between HTTP/1.1 and HTTP/2 clients is that they correctly handle early responses. Most HTTP clients these days will use HTTP/2 if possible, which ought to solve our problem. As it happens, our proxy didn’t have HTTP/2 enabled. Once I turned HTTP/2 support on, I started getting 413 Too Large responses from the GLS when attempting to upload files over the size limit. :tada:

The two things that cracked this for me were @Daniel_T_Shaw’s bug report and an issue on GitHub for axum, the web framework we use for the GLS. Daniel confirmed for me that he could reproduce the problem and it was happening when he was trying to upload a file that was too large. The other times when people have reported a 502 when trying to upload a file, it might have been happening due to early responses for reasons other than the file being too large. Once I knew the problem was triggered by large files, I was able to reproduce it myself.

This issue on GitHub was opened by someone having the same early response problem we were with axum, and contains an explanation of what’s happening with early responses. Once I had that in hand, it all made sense and the fix—enabling HTTP/2—was two lines in a config file.

5 Likes

Kudos to @Daniel_T_Shaw and @uckelman for diagnosing the problem. And thanks to @uckelman for the thorough description and remedy. Illuminating.

Reminds me of Linus’ law:

Given enough eyeballs, all bugs are shallow
Raymond, E.S., The Cathedral and the Bazaar, 1999

and why, in a volunteer, open-source, project such as Vassal, that we help each other out.

Yours,
Christian

2 Likes

Some of why I take the time to write up post-mortems is that I am constantly benefiting from post-mortems written by people on other projects. It’s a rare problem that you can’t find already described by someone else.

3 Likes