Blurry African American businessman taking notes in office with double exposure of immersive HUD infographic interface. Concept of data analysis and hi tech. Toned image

Challenge

The Center for Machine Learning provides support to several lines of business (LOB) at an enterprise level. Some of these LOBs support critical systems that require a very high level of availability; that level of service availability brings its own challenges. The main challenges include implementing (more) relevant service level indicators and objective metrics for all tenants of the platform, standardizing approaches to resolving incidents, and minimizing the downstream impact of incidents. Without addressing these issues, the potential negative impacts are degradation of service quality, inconsistent reporting, and a drop in productivity. All impacts the team looked to avoid. Proper documentation of critical systems is especially problematic, knowing that if a system goes down, there is potential for a regulatory process to not run and key information to not reach the appropriate parties promptly. These are all challenges for which we provided support, suggested improvements, and action plans to mitigate them.

Solution

Ippon provided the Center for Machine Learning with a "team in a box" composed of several site reliability engineers (SRE). These SRE engineers took on the challenge of overseeing the adoption of SLI and SLO metrics for users of the platform, setting up dashboards to properly reflect the aforementioned metrics, redefining the incident support model, and gathering data about all critical services on the platform. As a result of the SREs work, the center for machine learning at a large financial institution has better metrics for tracking and reporting incidents.

Once a MVP was demoed and approved, engineers met with team leads, senior engineers, and POs to determine what data was relevant and critical to be put in New Relic dashboards. Only key SLIs for all services were included so that senior leadership could understand the health of the platform at a glance. Metrics were collected from New Relic agents, CloudTrail, Snowflake, and ServiceNow. Redefining the support model included determining the scope of support. Ultimately, it was decided that support included dedicated Slack support channels, email notifications, pager duty alerts, proper ticketing, how to handle escalations, defining areas of responsibility, and writing post-mortems. All of these aspects had to be defined in a way that covered the majority of issues faced by the platform but also had enough flexibility to allow engineers to pivot as needed when triaging incidents.

Benefits

As a result of the SRE work within the Center for Machine Learning, the support model was reworked to incorporate more LOBS on the platform. Once the New Relic dashboards were running in production, incident management had higher visibility. Stakeholders can now track the SLOs for individual entities and use that to influence prioritization for platform improvements.

COMPANY DETAIL

This Top 10 U.S. bank is a leading player in banking and financial services, specializing in credit cards, loans, banking, and savings. Known for its innovative, technology-driven approach, the bank leverages data analytics and digital platforms to provide personalized solutions, simplifying banking and improving accessibility. Its commitment to the customer experience has made it a formidable force in modern finance.

Contact Us

We appreciate your interest in Ippon. Share with us how we can contribute to your success.

Name(Required)
What do you need help with? Check all that apply.