Database write amplification

I’m writing this in the spirit of Cunningham’s law: “the best way to get the right answer on the internet is not to ask a question; it’s to post the wrong answer.” I suspect there are inaccuracies here, and will amend when informed of them.

I was reading PostgreSQL Internals: 3 Things to Know About UPDATE Statements which begins with a nice, succinct discussion of how Postgres’ MVCC design results in write amplification. Basically, updates to rows are not made “in place” but rather by writing complete new versions of the row. Depending on the columns of a given table, and what is being updated, modifying a single byte (e.g. for a boolean column) can lead to a write amplification factor of several orders of magnitude, if the row also contains > 100-byte varchar columns, since a completely new copy of the row is written, along with the changes, and the old row is vacuumed later.

This reminded me of the article from Uber several years ago on why they switched from Postgres to MySQL, one of the main reasons being write amplification. But Postgres is not alone in using MVCC to provide ACID guarantees, MySQL uses MVCC as well, albeit a fairly different implementation. But then, if that’s the case, how does MySQL avoid write amplification.

MySQL write amplification?

According to MySQL’s docs on InnoDB Multi-Versioning, MySQL like Postgres uses multiple versions of the same row to service concurrent queries. But rather than writing entirely new row copies, MySQL uses undo logs which are maintained in a separate undo tablespace, used both for transaction rollbacks and to “build earlier versions of a row for a consistent read.”

This latter statement suggests that rather than storing an entire copy of old data, something like a changelog is maintained, which can be used to reconstruct the old version of the row at read time. The docs link to Consistent Nonlocking Reads, but this covers just the API/expected use rather than the implementation details. We also read that “The physical size of an undo log record in the rollback segment is typically smaller than the corresponding inserted or updated row” which further suggests that undo logs don’t contain full copies, but also that they aren’t just a changelog or ordered list of diffs either.

For a more in-depth view of InnoDB, we can turn to An In-Depth Analysis of UNDO Logs in InnoDB. Summarizing the relevant parts:

InnoDB has 3 different types of undo record, and the “update” type appears to support a dynamic number of “update fields” which indeed do seem to function like a changelog. So if we are changing just one column on a row, rather than copying the whole row, InnoDB just creates a new undo log entry with the changed data.
As explained in section 7 of the article, “Undo for MVCC”, as we suspected a transaction will (using its own transaction ID and the set of visible transaction IDs it is assigned at start time) walk back through a row’s undo log to find relevant versions and reconstruct them to return the correct result (according to the isolation level).

Performance implications

So we can surmise from this, with some educated guessing:

Given that the MySQL undo log only writes the changed fields rather than copying whole rows, it is plausible that, depending on the nature of writes and table structure, this will indeed result in less write amplification than Postgres’ MVCC approach.
But note that the MySQL undo log entries also contain a fair amount of header (and footer) metadata. For a small Postgres table (e.g. only a couple int/byte columns) it’s plausible that Postgres’ MVCC approach would actually use less space by copying the whole row.
For situations with several open transactions, each of which have modified different columns of a row, a reader will probably have to do more work to reconstruct the result with MySQL vs Postgres.
- On the other hand, the actual performance impacts of this will probably largely depend on how the data is stored. If old Postgres tuples are stored scattered across multiple pages, compared to MySQL which seems to store these undo log entries generally close together in a dedicated tablespace, then Postgres’ algorithm might be faster from a Big O notation point of view, but could perform worse in practice due to poor data locality, in other words, read amplification.
Some other aspects of the MySQL approach seem like they could lead to problems, like limitations to the number of undo log entries which presumably limit the number and range of open transactions, but I’m not going to dig into that now.

But thinking back to the Uber article, the write amplification they are focused on is a very different type, having nothing explicitly to do with MVCC. Instead, it’s all about how Postgres must update all indexes when a row is modified, even if you are only modifying non-indexed columns. Postres has since introduced (or improved) support for something called Heap Only Tuple (HOT) updates, which are covered well here Fighting PostgreSQL write amplification with HOT updates.

My main takeaway from looking into this is that write amplification can and does happen in various places. “Solving” it will often be a matter of shuffling it around or increasing read amplification, rather than just making it go away. This reminds me of fan-out vs fan-in approaches to implementing “news feed” style features: do you make reads fast for everyone, by having every new post write to all the followers’ feeds, thereby making writes slow (fanout, write amplification)? Or do you make write fast, but reads slow by having them query all the followee’s post list (fan in, read amplification)? In practice I remember reading that Twitter did some of both, doing fanouts for most users, but doing fan-ins for high-follower-count users, so that Justin Bieber didn’t cause downtime whenever he tweeted.