Using the saga design pattern for microservices transactions

Dividing business actions into their smallest possible units through microservices sounds like a wonderful idea. However, in business-critical scenarios, the architecture must align and coordinate half a dozen different distributed services and systems. If a step fails, an entire business process may need to be rolled back and corrected, much like a failed database transaction.

Microservices, by default, have no coordination or control function: individual services should by nature have no knowledge of others outside of themselves. As such, maintaining smooth process flows between systems is much easier said than done without some sort of supervisory controller mechanism. Fortunately, we can achieve this through an architectural design known as saga model.

What is the design model of the saga?

Imagine an Electronic Data Interchange (EDI) system interface that takes orders from customers and passes them to a product ordering system. This order service then initiates a chain of service calls that alert downstream manufacturing and shipping systems to complete the transaction. The diagram below shows an example pattern illustrating this, where services call each other in a “round robin” style. Once the whole transaction chain is finalized, the shipment sends a message to the product ordering department to confirm its completion. It can be considered a service choreography method.

Typically, the application will perform each of these actions one at a time. If the production order and the shipping order are fulfilled, but the payment transaction fails, a team member or the system must send an alert to the previous departments to roll back.

Unfortunately, things get a bit more complex when large-scale business transactions span long periods of time. If one of these systems fails along the way or the order is canceled, we need a system that can perform a logical rollback and reset all of these systems and transactions. For example, a single payment transaction failure could very well force teams to roll back dozens of previous transactions that were made by dozens of separate systems.

Airline ticketing systems are a perfect example of this problem. An unexpected event can cause people to cancel a trip minutes before the plane takes off. This one-time cancellation will require ticketing systems to adjust seat availability, baggage to reroute, payment systems to issue necessary refunds – and these are likely just some of the steps involved.

As you can imagine, this system will need a way to “wash” itself by reversing some of the previous messages between web services. Unfortunately, our transactions are just too complex and time-consuming to try to simply call all the services and validate them. This requires a slightly more nuanced controller than a master program. This requires a controller that can take ownership of the entire process, which is where the saga model comes in to deliver service orchestration.

Saga implementation for web services

To illustrate the saga design pattern, imagine that your team implements an enterprise service bus that listens for particular transaction events and then passes messages to systems to start them running. Once the bus creates a message that represents the event, it is sent to any service associated with that event. In this case, the controller is a web service that is triggered by this event. This controller makes function calls to the next enterprise web service in the queue.

Note that this gives us two types of services:

  • Controllers, which receive events in the form of messages and then relay functional instructions to other services; and
  • Services that execute the actual business process that needs to happen and then communicate their completion to move the transaction forward.
Diagram with round robin style serve choreography

To implement these controller services, you can basically create an event handler for an event driven application or introduce a finite state machine that simulates sequential logic. This component can then take the message, determine where it came from, analyze its state, and then process the next command. This can be accomplished simply through a switch statement, a set of nested IF commands, or even a single database lookup.

Keep in mind that implementing this design can still be tricky if the overall application demands high reliability. For example, imagine that the controller service crashes after triggering an “order placed” event, but before it can pass a “payment completed” event. When this service restarts, it will need to access some kind of transaction log, check for unprocessed transactions, resubmit the event(s), and mark the job as done. This leaves the possibility that if a commit fails, a system shuts down after sending the event, but before it actually confirms the commit. There are a number of architectural patterns that solve this specific problem, but the simplest by far is to allow redundant messaging, but schedule the services to ignore them if necessary.

Do you need to implement the saga design pattern?

The focus of the saga design pattern is primarily to take long-running, multi-system business processes and add the ability to restore failed systems in an intelligent way. However, it adds more code, which means new layers of complexity, debugging challenges, bandwidth and processing power requirements.

Frankly speaking, an orchestration-driven saga model will generally prove overkill for simple application-based transactions. Unless your organization struggles with managing large chains of business processes, the code complexity involved in a saga design pattern can cause more problems than it solves. But, if long-running trades keep you up at night – especially when it comes to dealing with failures – the saga model may be the answer you’ve been looking for.

Abdul J. Gaspar