Orchestration-based Saga on Serverless

Recently, I saw a very good Azure sample code on GitHub which is about orchestration based saga on serverless. It looks very promising for serverless architectures to solve real-world business problems for futuristic solution design. So sharing the same details with you guys with links of code repositories and all. I hope this will be beneficial for you guys.

Problem

Contoso Bank is building a new payment platform leveraging the development of microservices to rapidly offer new features in the market, where legacy and new applications coexist. Operations are now distributed across applications and databases, and Contoso needs a new architecture and implementation design to ensure data consistency on financial transactions.

The traditional ACID approach is not suited anymore for Contoso Bank as the data of operations are now spanned into isolated databases. Instead of ACID transactions, a Saga addresses the challenge by coordinating a workflow through a message-driven sequence of local transactions to ensure data consistency.

Solution

The solution simulates a money transfer scenario, where an amount is transferred between bank accounts through credit/debit operations and an operation receipt is generated for the requester. It is a Saga pattern implementation reference through an orchestration approach in a serverless architecture on Azure. The solution leverages Azure Functions for the implementation of Saga participants, Azure Durable Functions for the implementation of the Saga orchestrator, Azure Event Hubs as the data streaming platform and Azure Cosmos DB as the database service.

The implementation reference addresses the following challenges and concerns from Contoso Bank:

Developer experience: A solution that allows developers focus only on the business logic of the Saga participants and simplify the implementation of stateful workflows on the Saga orchestrator. The proposed solution leverages the Azure Functions programming model, reducing the overhead on state management, checkpointing (mechanism that updates the offset of a messaging partition when a consumer service processes a message) and restarts in case of failures.

Resiliency: A solution capable of handling a set of potential transient failures (e.g. operation retries on databases and message streaming platforms, timeout handling). The proposed solution applies a set of design patterns (e.g. Retry and Circuit Breaker) on operations with Event Hubs and Cosmos DB, as well as timeout handling on the production of commands and events.

Idempotency: A solution where each Saga participant can execute multiple times and provide the same result to reduce side effects, as well as to ensure data consistency. The proposed solution relies on validations on Cosmos DB for idempotency, making sure there is no duplication on the transaction state and no duplication on the creation of events.

Observability: A solution that is capable of monitoring and tracking the Saga workflow states per transaction. The proposed solution leverages Cosmos DB collections that allow the track of the workflow by applying a single query.

Architecture

Check the following sections about the core components of the solution, workflows, and design decisions:

Source Code: https://github.com/Azure-Samples/saga-orchestration-serverless

I hope this will be helpful !!!

Distributed Transaction in Microservices using SAGA Pattern

Everyone today is thinking about and building microservices – I included. Microservices, from its core principles and in its true context, is a distributed system.

In a microservice architecture, a distributed transaction is an outdated approach that causes severe scalability issues. Modern patterns that rely on asynchronous data replication or model distributed write operations as orchestrated or choreographed SAGAs avoid these problems. I will try to explain orchestrated saga in great detail in this article.

Distributed Transaction is one that spans multiple databases across the network while preserving ACID properties. If a transaction requires service A and B both write to their own database, and rollback if either A or B fails, then it is a distributed transaction.

Problem statement

To see why a distributed transaction is hard, let’s take a look at an extremely common real-life examples: e-Commerce application.

Our e-Commerce application contains different microservices. Each service has its own database. Some business transactions, however, span multiple service so you need a mechanism to ensure data consistency across services.

Let’s say, we have an order service, an inventory service and a payment service. The boundary is clear, order service takes the order, inventory service allocates the stock, while the payment service deals only with payment and refund related issues.

A single order transaction = creating an order + reserve stock + payment, in any order. Failure at any point during the transaction should revert everything before it.

Payment failure should cause the inventory service to release the reserved stocks, and the order service to cancel the order.

if our e-commerce application designed as per above then there are some serious flaws with this approach, which are:

  • The fallacy of the distributed system – Relies heavily on the stability of the network throughout the transaction.
  • Transactions could end up in an indeterminate state.
  • Fragile to topology changes – Each system has explicit knowledge of its dependency.

Imagine payment service calls some 3rd party API like PayPal or Stripe, the transaction is effectively out of your control. What happens if the API is down or throttled. Or a network disruption along the network path. Or one of the 3 services is down.

if the inventory service managed to reserve some stocks, but the payment service timed out for whatever reason, we cannot say that the payment has failed.

If we treat timeout as a failure, we would have rolled back the stock reservation and cancel the order, but the payment actually did go through, perhaps the external payment API is taking more time than usual or network disruption, so we cut off the connection before payment service has a chance to respond. Now the transaction is in Paid and Stock Released state simultaneously.

This is really painful, isn’t it? Your Production support team will be very busy handling such failed transaction tickets if your buyers face such issues frequently while placing an order. What if your buyer get fade up and orders from other competitors. Such small incidents can lead to HUGE financial loss.

As an Software Architect, you must think of such a problem and design your microservices & application in such a way that it does not leave data inconsistency during the transaction.

Solution

We can overcome this problem of data consistency between databases by using Saga Pattern. It models the globally distributed transaction as a series of local ACID transactions, with compensation as a rollback mechanism. The global transaction move between different defined states depending on the result of the local transaction execution.

There are two ways of coordination sagas:

  • Choreography – each local transaction publishes domain events that trigger local transactions in other services
  • Orchestration – an orchestrator (object) tells the participants what local transactions to execute

The difference is the method of state transition, we will talk about the “Orchestration” in this post.

Orchestration Based Saga

An orchestration-based saga has an orchestrator that tells the saga’s participants what to do. The saga orchestrator communicates with the participants using request/asynchronous response-style interaction. To execute a saga step, it sends a command message to a participant telling it what operation to perform. After the saga participant has performed the operation, it sends a reply message to the orchestrator. The orchestrator then processes the reply message and determines which saga step to perform next.

This type of Saga is a natural evolution from the naive implementation because it can be incrementally adopted.

Orchestrator (Or)

Or a transaction manager is a coarse-grained service that exists only to facilitate the Saga. It is responsible for coordinating the global transaction flow, that is, communicating with the appropriate services that involve in the transaction, and orchestrate the necessary compensation action. The orchestrator is aware of the globally distributed transaction, but the individual services are only aware of their local transaction.

Message broker

A service’s local ACID transaction should ideally consist of two steps:

  1. Local business logic
  2. Notify broker of its work done

Instead of calling another service in the middle of the transaction, let the service do its job within its scope and publishes the status through a message broker. That’s all. No long, synchronous, blocking call somewhere in the middle of the transaction. You can use any message broker (Event Hub or Kafka) as per your need and dependency on your cloud platform.

Event sourcing

To ensure that the two steps are in a single ACID transaction, we can make use of the Event sourcing pattern. When we write the result of the local transaction into the database, the work done message is included as part of the transaction as well, into an event store table.

NOTE: Applications persist events in an event store, which is a database of events. The store has an API for adding and retrieving an entity’s events. The event store also behaves like a message broker. It provides an API that enables services to subscribe to events. When a service saves an event in the event store, it is delivered to all interested subscribers.

Compensation

Once a service has done its work, it publishes a message to the broker (could be a success or failure message). If the Payment service publishes a failure message, then the orchestrator must be able to “rollback” actions done by the Order and Inventory service.

In this case, each service must implement its version of the compensating method. Order service which provides a OrderCreate method must also provide a OrderCancel compensating method. Inventory service which provides a ReserveStock method must also provide a ReleaseStock compensating method. Payment service which provides a Pay method must also provide a Refund compensating method.

The orchestrator then listens to the failure events and publishes a corresponding compensating event. The above image shows Orchestrator publishes respective compensation events and how each services rollback their operation to compensate payment failed requests.

Conclusion

This is not a remedy to apply “traditional transaction” at the level of a distributed system. Rather, it models transactions as a state machine, with each service’s local transaction acting as a state transition function.

It guarantees that the transaction is always in one of the many defined states. In the event of network disruption, you can always fix the problem and resume the transaction from the last known state.

I hope this will help !!!

NOTE — References taken from Microservices.io

Choose between Azure Event Grid and Event Hubs

This article describes the basic understanding of these two services, and help us understand which one to choose for our application. Let us start with basic understanding of these two services.

Event Grid

Event Grid is a fully-managed event routing service and the first of its kind. Azure Event Grid greatly simplifies the development of event-based applications and simplifies the creation of serverless workflows. Using a single service, Azure Event Grid manages all routing of events from any source, to any destination, for any application.

It uses a publish-subscribe model. Publishers emit events, but have no expectation about which events are handled. Subscribers decide which events they want to handle.

Event Grid supports dead-lettering for events that aren’t delivered to an endpoint.

It has the following characteristics:

dynamically scalable
low cost
serverless
at least once delivery

Event Hubs

Event Hubs is a Big Data streaming platform and event ingestion service, capable of receiving and processing millions of events per second. Event Hubs can process and store events, data, or telemetry produced by distributed software and devices

It facilitates the capture, retention, and replay of telemetry and event stream data. The data can come from many concurrent sources. Event Hubs allows telemetry and event data to be made available to a variety of stream-processing infrastructures and analytics services. It is available either as data streams or bundled event batches.

This service provides a single solution that enables rapid data retrieval for real-time processing as well as repeated replay of stored raw data. It can capture the streaming data into a file for processing and analysis.

It has the following characteristics:

low latency
capable of receiving and processing millions of events per second
at least once delivery

In some cases, we use the services side by side to fulfill distinct roles. For example, an e-commerce site can use Service Bus to process the order, Event Hubs to capture site telemetry, and Event Grid to respond to events like an item was shipped.

Choose Event Grid when

  • Simplicity: It is straightforward to connect sources to subscribers in Event Grid.
  • Advanced filtering: Subscriptions have close control over the events they receive from a topic.
  • Fan-out: You can subscribe to an unlimited number of endpoints to the same events and topics.
  • Reliability: Event Grid retries event delivery for up to 24 hours for each subscription.
  • Pay-per-event: Pay only for the number of events that you transmit.

Choose Event Hubs when

  • You need to support authenticating a large number of publishers.
  • You need to save a stream of events to Data Lake or Blob storage.
  • You need aggregation or analytics on your event stream.
  • You need reliable messaging or resiliency.

I hope this will help !!!

NOTE — Reference taken from Microsoft Learning Site