As a lot of y’all probably noticed, AO3 had some extended downtime this past weekend.
They upgraded their database software on Saturday. The first crash happened with Sunday-evening traffic. AD&T (the committee of OTW volunteers who manage the AO3 software) painstakingly nursed the servers back onto their feet. Then they crashed again under Monday-evening traffic. (I’m describing this in US time zones, because it correlated with “the spike when USians get out of school/work and start opening fic en masse.”)
Official AO3 social-media posts haven’t named the software. I assumed because it’s a third-party vendor, and they (very reasonably!) don’t want the place getting deluged with angry emails from AO3’s wankiest users. But at this point, there’s other public confirmation that it’s MariaDB.
AO3’s public Jira board had two new tickets created while they were dealing with the downtime. There’s a helpful breakdown of the tech implications by siropsalot on Bluesky. In short:
“Audits cleanup job” – AO3 has been storing certain logs in a single giant table that updates forever and never gets archived or cleared, which is fine if you’re a small or low-traffic project, but bad if you’re one of the top 100 highest-traffic sites on the internet. (This is part of a long pattern of AO3 being, ah, poorly-designed for the scale of traffic it gets.) This ticket is to create a regular clearing-up process.
“Patch Devise to prevent excessive audits” – One specific user has a buggy older browser, which generated over 2 million entries in the giant table just for them. This ticket is to patch against that specific edge case.
(Denise was pretty alarmed by the first one, because the proposed fix might delete data the OTW is legally required to keep. That’s a tangent, this is mostly a post about the tech problems, I’m just throwing it in because it seems worth knowing.)
MariaDB also has a public Jira board. Which documents this bug in the version of the software that AO3 just upgraded to: ““Local temporary space limit reached” on not so rare occasions.”
AD&T brought the site back up on Tuesday. It’s been safely up ever since. My understanding is, it stayed up after tech support from MariaDB helped them troubleshoot and implement a workaround for that issue.
Disclaimer that I am not a programmer! Someone more technical might come along and correct me on this! AD&T is working on an official postmortem — hopefully after they catch up on some well-deserved sleep — which will be way more illuminating than anything I can figure out in the meantime.
But my impression right now is that “AO3 software has problems with huge poorly-managed piles of data” ran headlong into “MariaDB upgrade has problems with not allocating certain operations enough space,” and it went about as well as a 12′ truck trying to drive under the 11’8″ bridge.
(Except in this case it’s a more normal bridge, where safely-loaded trucks usually pass under it with no problem, while AO3 is…I guess a truck with an extra 5 feet of clearance, caused by a wobbly pile of stuff held on top of it with a precarious set of bungee cords?) (It’s not a perfect analogy, okay. But you get the point.)


