Event-based Monitoring and Reconciliation

In this article I will describe, how we created our Monitoring and Reconciliation solution for microservices-based billing system

Problem

CloudBilling is a microservices-based system. It means, that it contains many separated parts, which performs different operation with data passed through them. Also, when you’re thinking about not only «blocks«, but «billing«, it means that you operate with big amounts of data, which have be consistent, and their moving through the system should be transactional.

Totally, we can’t afford losing some data between the start and the end of our billing chain. To ensure that we have processed all amount of data, or find our weak places, it’s necessary to have availability to perform reconciliation check. Moreover, as our system should be ready for high performance, we need some solution which will provide a possibility for monitoring processed amount of customer data.

Search of solution

Let’s think, which ways can we go. 

So, we can perform a check of processed data in each service through our chain. It looks logic, but imagine that you have 5 or 10 millions of objects that should be processed. Even if you have 5 services in this chain, you should re-bill about 25 millions of objects. Moreover, that approach means that operations would be performed with customer’s data directly, which should be processed only by target components. In fact, using this way, we will re-invent the billing.

Well, if we don’t want to build a huge solution, we need something lightweight and flexible. What if some check will be performed not with objects directly, but with their footprints? Looking this footprints, we can check every object for the whole way. For that purposes it’s necessary to have some entries with information. Looking at this point, we also need an information about service which operated with objects, type and status of operation and ID’s of these object before and after operation. Moreover, it’s better to understand when this object was turned in our system, so we also need to add processing start timestamp. Definitely, it’s all about trace-events.

Designing events solution

As previously said, we have a strict requirement about transactions and consistency. It also means that our trace-events will be created only in case when some operation was performed and these event’s should be sent in the same time when object went to next step. Hopefully, our services performs communication via Apache Kafka, so we can send our events (into specific topic) and objects in one transaction.

By the way, there is also should be a nice, stable software which could index our structured trace-events with many fields. As ELK-stack is used for logs collecting, it also can work with other objects passed to it’s indexes.

For connecting ELK and Kafka, we need configurable and reliable solution, so Fluentd is used.

Combining Kafka, ELK-stack and Fluentd, we became able for looking our object footprints in Kibana (for our internal purposes) and get some statistics about it. Not bad for this stage.

Monitoring part

Well, we have a footprints, which describe a movement of our objects through whole system. That means, that if we count these footprints by some criterias, we can get an information about some component’s point throughput in business-entities.

Hopefully, Elasticsearch has a REST API, which contains count-requests. Our criteria in that case will be a component, action (performed with our objects) and timestamp interval. Collecting that data in cron-mode for small time intervals, we will take all footprints counts for selected point in every time moment.

For making our counts more useful, we designed UI and REST API for that in our Monitoring service. Using parameters received by request, we can aggregate this data and provide applicable info, which could be representative for users:

Reconciliation part

For reconciliation purposes we need some mechanism for checking our footprints at every step of our chain. In this case, footprints contains common fields – id of entry and processing start timestamp. If number of footprints at our start point is equal to sum of end points, report with successful result is prepared. Otherwise, If it is not equal – we can perform collecting and aggregating this data by common fields, compare a number of checked points from our configuration with end result, and totally, find failed point for every object. This failed point and correlated id are going to be added to report.

Data observing is going non-stop, but for performing reconciliation there is need some check for time period (Analyzed period + Buffer period). It means that we check for every N minutes batch, but also give some time for objects to be totally processed:

When all calculations are processed, some reports are prepared. I’s better to see at out UI:

When user wants to see if something was wrong, it’s possible to download CSV report with object ID and failed component’s name: 

Conclusion

Complex and difficult systems requires lightweight, but powerful solutions inside them. We designed and implemented an idea, which has been accepted not only by our functional requirements, but respecting the customer data.

Using well-known technologies with accessible documentation, we decreased a risks of support problems. Moreover, as our solution does not depend on type of customer data, but only on footprints, it becomes very flexible for different types of information systems.

P.S.

Also, I prepared a speech for meet-up (in Russian), which is available on Youtube: