Business Problem

The client’s enterprise systems comprised multiple upstream and downstream systems that participated in fulfillment of a business function. There were a few issues with effective triaging across cross-functional microservices teams

No standard procedure of logging entries and error reporting by various cross functional teams

  • Lack of standardization in logs and ineffective usage of log analytics tools like Splunk, led to significant time spent by the Client’s teams in analyzing the issue before it was assigned to the right team. Many times, this delay used to impact critical business activities
  • No unified way of tracking and visualization of request from source service to the leaf node in the hierarchy (L0 to L5)
  • No visibility on significant latency issues in microservices

The Solutions

Our team did a thorough analysis of the logging system and implemented the following changes:

Enhanced Log Triage

Built effective triaging dashboards in Splunk using inbuilt indexing and analytics features, captured through a logging framework.

Latency Optimization

Identified high latency downstream API’s and resolved latency issues by tracking contribution of each service to its hierarchy.

Dynamic Event Sequencing

Built dynamic sequence of events – reducing the need to maintain sequence diagrams – for system design and architectural improvements.

Value Delivered

Drill down capability to track all the logs. The cross functional and geo-based teams found it very effective. They were able to identify the team responsible for resolving the issues.

The turnaround time for issue resolution was reduced from multiple days to minutes.

The dynamic performance metrics helped identify and resolve bottlenecks with no effort spent in identifying it.