AI Observability For Hybrid Cloud

AI Observability For Hybrid Cloud

In the world of DevOps, Cloud, and Platform Engineering when it comes to AI, the whole idea is to reduce the low-hanging fruit. It's a very similar concept to when automation for Sysadmins and Infrastructure Engineers become more and more popular.

One piece of "low-hanging fruit" is manually observing environments.

In this blog post, you'll learn why you want to use AI for observability and a tool/platform that can help you get the job done.

Why The Observability AI Push

Let's walk through a standard Site Reliability Engineering(SRE) scenario. You're on-call over the weekend, go to sleep around 9:00 PM, and all of a sudden get woken up at 2:00 AM to fix an issue. You see a few alerts come in, go through your monitoring and observability software, and have a few ideas to understand and potentially fix the problem.

💡
Please note: AI doesn't remove the need to troubleshoot. You still have to troubleshoot, but you can retrieve the data that you need to troubleshoot faster and more effectively.

By the time you find any data that'll actually help you (assuming that it's not a quick fix), you're a few hours in. You're tired, groggy, and not at 100% to truly fix the issue to the best of your abilities.

Fast forward to observability with AI. Spoiler alert: you still have to wake up at 2:00 AM (unless you have proper observability practices in place to perform specific actions against traces, logs, and metrics that are received), but you're able to come to a conclusion to solve the problem much faster. The goal of a Model, in this instance, is to comb through the data in a far quicker and more efficient way in comparison to a human. It's minutes compared to hours. Once you have the solution, you can go in and properly implement said solution (bonus points if you create an automated workload to solve the problem for you next time).

Enter Selector AI.

Selector AI Breakdown

AI, as mentioned in the opening of this blog post, is thought about as an "automation 2.0" experience when it comes to the world of Cloud/DevOps/Platform Engineers. Selector AI helps implement the monitoring and observability piece.

💡
Selector AI also has specific tooling around network and infrastructure observability.

The key helpful factor from an "automation 2.0" perspective is understanding the root cause analysis (RCA) of an incident. The goal of Selector AI is to give you far better insight into a particular problem instead of having to fumble around logs and traces manually to find the problem.

A few key features of Selector AI are:

  1. Digital twin for operational information about the system.
  2. Reducing the need for 10's of dashboards in standard monitoring and observability solutions.
  3. Render a meaningful answer for problems that are occurring in your environment. There should be no guessing.
  4. Copilot/chatbot related work. For example, you can ask Selector AI "what are the top 5 issues that have happened in the past 12 hours".
💡
Selector AI is not doing APM. It's handling all observability at the network and systems layer. It can also show things like requests to systems, databases, Kubernetes etc...

The primary method to get these actionable insights with Selector AI is to utilize the forecasting feature. With using Models to train against your environment, you can get a forecast to understand the issues that could potentially occur or that have already occurred. The forecasting is trained via the Models that Selector AI is using on the backend. A good way to think about it is Selector AI sits on top of your data.

Selector AI has two forms of observing an environment:

  • Pull monitoring and observability data from existing tools.
  • Use Selector AI without other monitoring and observability tools.

To wrap up this section, the biggest component to point out is the chatbot feature. It's a big play in what Selector AI is doing. The goal is you go into Slack, ask a particular question about your network, and get an answer.

Selector AI Implementation

Now that you know a bit about Selector AI from a theoretical perspective and the "why" behind it, let's talk about what the implementation will look like.

The first step is to integration Selector AI within your Slack organization. This is where the chatops integration comes into play. It's like getting an alert within your Slack channel(s). Within there, you can click on the alert and you'll see a graph automatically generated for you based on the data that was ingested for the particular problem that occurred.

When it comes to the data ingestion, you can receive the data from your assets that you want to monitor and observe from your own observability tools or data analytics tools, or you can use Selector AI as the monitoring and observability solution.

The data that's ingested is that trained on (via an existing Model) to give you actionable insights once the data is ingested.