Tuesday, July 24, 2012

TIPC: Replication Log Example

Tasty


Losing customer requests is terrible, at least for the applications that I work on.  Some applications can survive request loss by pushing the problem to the customer -- having them sort out the consistency problem and re-submit the request if necessary.  Here, I want to describe a solution I am working on to avoid request loss and leveraging the reliability of replication within a cluster to accomplish this goal.

The service of a user's request involves interacting with (possibly changing) the state of the world (persistent data on the server).  Each request received by the server needs to immediately be persisted reliably, so that only in the event of a catastrophic failure do we lose that request.  This means that we replicate that request to multiple nodes within the cluster or attempt to preserve it on a reliable storage device/appliance.

After investigating reliable storage with low latency, I came to the conclusion that I/O devices like these are very expensive and tend to be (for cheaper ones) installed in a single computer .  If you lose that computer, asking your server facility to pull a card out and put it into another machine does not lead to a zero down-time solution.  An expensive appliance sitting on your network can accomplish the goal of persistence, but then you have already opened the door to communicating on the network to do persistence.  You might as well persist by replicating the data to multiple nodes in your cluster.


Persistence by Cluster Replication



Consider the following picture where a service which has received the user request G and replicated to several instances of another service S which just stores and acknowledges the data:



Each of the pink boxes is a separate machine/node in your cluster with a separate power supply and UPS (hopefully) so that you can be reasonably certain that a minor disaster only affects one node.  In the terms used by Greg Lindahl in his talk on replication (which goes into how it is just a way better option than investing in RAID once you start to think about clustered architecture) we are achieving R3 replication.

The strategy is to receive the customer request in G, annotate it with some immediately available state into a message (e.g. timestamps, sequence numbers -- back away from that database handle hot-shot!) and send it to S with an increasing message sequence numberS receives this stream of incoming messages i, (i+1), (i+3), ... , and "stores" them into an sequence of memory mapped archive files which are each just N-element message arrays by placing message i into mapped file (i/N) at position (i mod N)

By using memory mapped files we get two excellent side-effects.  First, the virtual memory system of the node running S will eventually permanently persist the mapped file to disk for us on its own.  Second, we have the speed of a direct in-memory recv() into that mapped memory (assuming that the mapping of new archive files into memory is somehow done for us using another thread.  In order to get the multicast send/recv done, I am going to use TIPC and its reliable multicast socket which I discussed in an earlier posting.

Once the message is properly received, an extra word in the S log is set to indicate "valid data", and an acknowledgement (containing the sequence number) is multicast on a separate channel back to G.  Ideally, G has a thread for processing customer requests and sending S messages, a rotating buffer to keep those "in the process of persistence" messages, and a second thread which deals with acknowledgements (implementing an "ack" policy which indicates how much persistence is enough).  Having ensured persistence by the time the policy is fulfilled in this second thread, G is free to send some further acknowledgement to the customer ("we have it!").

Losing an S


Consider what happens when we lose a node that runs an S:


In this case we degrade to R2 replication, and we start to get nervous.  Fortunately our cluster resource manager (a.k.a Pacemaker for my setup) notices this and starts up another S somewhere else:



The fresh S has empty archives however, so it needs to immediately start listening to G and asking the other Ss to fill it in on what has been going on.  This is done by implementing a second thread in S dedicated to peer-to-peer replay of existing archive data for the benefit of the newcomer.

Losing a G


If G restarts or needs to be run fresh on a new node since the node it was running on fails, then it will need to come up and ask the network of Ss what the last message sequence number was (so that it can start generating new messages which do not overlap or have gaps with that).  A fourth TIPC reliable multicast channel is dedicated to this purpose to serve that information.  G waits a specified period (I think 1 second) to receive as many answers to that question as possible and uses the maximum response sequence number to begin its new stream.

It is possible to lose a request when G is lost, but that request will not have been acknowledged.  The low latency characteristics of G, make the likelihood of losing customer traffic which has yet to be received by G on the front-side more likely.

Benefits


There are a number of benefits to this scheme:

  1. With low latency we are certain to reliably persist the customer's request
  2. No relational databases are disturbed which might have unpredictable latency profiles (adversely affecting performance)
  3. Replication scales well in a multi-cell sense (sharding) on the cluster
  4. Each node with an S running could have an API to walk the memory mapped archive files independently (without disrupting S) and (possibly) do logarithmic (in space) searches on the data stored there
  5. Minimal indexing is maintained -- its just a binary log of same-sized messages we are replicating -- so it does not affect the performance.
  6. Less reliance on non-volatile storage and its failure characteristics (spinning rust as they call it), in favour of RAM-based storage
  7. Leveraging the efficiency of virtual memory to chose the moment and method to "sync" data to non-volatile storage.
TIPC's reliable multicast really shines for this problem, since it enables a to have services at cluster level (the detail of which node they run on is not necessary for the programmer to delve into), and the ability to reliably send message-oriented packets over multicast to just the nodes who are listening is a big win.

No comments:

Post a Comment