Replication Lag: Messages from the Future
In a previous post, I wrote about maintaining read-your-writes consistency when replica lag occurs. A user will write changes to the master database, and the immediate reads after are routed to a lagging read replica that does not reflect the changes your user just made. In effect, your user is transported to the past, with an outdated view of the database.
This particular case deals with a user working on a single data store. What are some problems we might have when we move to a multi-store system? More specifically, what happens when we have multiple services sending messages to each other with different views of the same data? In this post we'll go through a particular strange bug we faced at Drop in Q4 2019 which illustrates exactly that.
Problem
Our system uses a publisher-subscriber model to support an events based architecture in our rails application. We abstract certain database changes in PostgreSQL into domain events and send these events as messages to subscribers using Redis.
Our system worked quite well for many months with no issues but during certain periods of time, we noticed higher failure rates for our subscribers. After further investigation, we realized that these periods of high failure rates lined up with when our replication lag increased.
What happened was that our pub-sub system was sending events faster than PostgreSQL was replicating the database changes that those events represented. So a subscriber would receive a message that some event had occurred, but the read replica did not yet reflect those changes. In effect our subscribers were receiving messages from the future.
What do? 🤷
What can we do to allow our subscribers to know whether the event it received is reflected in its view of the database? The solution was actually quite straight forward.
Background
Our system uses the out_box pattern to guarantee at least once delivery of our messages. Whenever we made a database change that we wanted to abstract into a domain event, we also create an out_box record representing that domain event in a transaction block. That way the out_box event is guaranteed to exist in the database only if the changes it represents is reflected in the database. We can then use a background worker to grab any unpublished out_box events and send them to subscribers accordingly.
Solution
So how do we address this issue? For messages regarding database creates, the solution is quite trivial. Simply attempt to read the data from the database, and if its not there, error out and retry your job (This assumes you have built in retry logic in your pub/sub system, which we do). But what do we do for messages regarding database updates? How does the subscriber know the current state of the data in the replica, reflects the change the event corresponds to?
There are a bunch of ways we can approach solving this problem. For example we can pass what database changes occurred in the payload of the message, or we can pass a timestamp and check that the timestamp corresponds with the updated_at field of that record.
We chose a simpler option. Pass the event_id as part of our message, and the subscriber will start by reading the event from the database. Since the outbox event is created in a transaction block with the database change it represents, if the outbox event is in the database we can guarantee that the subscriber is working on sufficiently fresh data. Now when our subscriber runs, if the outbox event is not in the database, we know there is replica lag and we can simply error out and retry. We could have some fancy logic to delay the subscriber's retry corresponding to replica lag, but we just let our built-in retry logic handle those semantics.
Note that our assumption that the presence of the outbox event means we are working on sufficiently fresh data assumes our application doesn't require consistent prefix reads. Consistent prefix reads is a guarantee that clients will see changes in the order in which they occurred. We do not have this guarantee with our publisher-subscriber system, but our application doesn't actually need to support this guarantee (at this point in time). If our application ever needs to support consistent prefix needs, we will address the problem then and write another blog post 😬.
TLDR
- If our message queue is processing messages faster than our database is replicating, subscribers will be responding to messages that refer to db events that are not yet reflected in replicas.
- Make sure the consumers of your messages have built-in retry logic if they need to read the data that the message refers to.
- Force your consumer to read the outbox event it is subscribing to explicitly, to guarantee that the database's state is consistent with the changes the outbox event corresponds too.
- If your use case requires consistent prefix reads, then you are screwed and you need a creative workaround to address it. When you solve the problem, share your knowledge and write a blog post 😀