Chaos Theory and Observability – Gigaom

Can observability deal with the IT chaos facing so many enterprises today? It’s a question worth digging into.

IT Chaos (Monitoring, Observability, and Intelligence)

IT chaos is a function of monitoring, observability, and intelligence. Yes, I added intelligence, but I’m not talking about artificial intelligence (AI)—yet. Just as monitoring has generated more data than humans can consume, observability can produce more observations than anyone can understand. The overload of observation information is particularly true when multiple observation tools come into play.

Machine learning can help, but the questions we want to answer are changing. Once, we wanted to know if services in a public cloud worked and how to merge that data with the on-premises noise. Now, the questions have changed to what to do about the observations. Automation allows restarting poorly performing items and expanding memory or computing power on demand, but you have to store the data somewhere, and storage is not free. Leading observability solutions now include real-time cost comparisons between cloud vendors. The best observability tools have financial operations (FinOps) abilities to find underused, overused, and abandoned resources in clouds (public or private).

Observability tooling has enough data to predict future states. Unfortunately, chaos theory does not help. Data at the element level does not exist at the observability level. Regression analysis, least-squares fits, and more complicated algorithms allow the prediction of chaos. The more data available, the more accurate the predictions, but storing data is costly. Vendors are addressing the issues with consumption-based licensing, lower-cost storage tiers, and other methods to deal with the wave of data needed for observability.

IT chaos will never end, but at least we can try to manage it. The new hope is generative AI (GenAI)—maybe.

Chaos, Observability, and Artificial Intelligence

The chaos function contains the steps from monitoring to observability to intelligence and requires new approaches to answer questions. Monitoring tells us the state of items, observability can create relationships and provide a meta view of the elements, and intelligent questions are possible with the help of GenAI.

Ask an observability tool when the next outage will occur, and you may get an answer. Ask it to automate a known failure mode, and it performs a perfect dance. Ask an observability tool if the enterprise is OK, and you get nothing. The question is beyond its capabilities. Observability tools as they exist today focus on IT, including developers in DevOps pipelines, operations management team members working to keep the lights on, and the newly coined (by my more than 40-year standard) system reliability engineers (SREs). Observability explains the data from monitoring.

Enter GenAI, the big rock in the pond creating its version of chaos. In chaos theory, a single element can tip an entire system over the edge. The math makes this abundantly clear (I’ll get to that in a moment). So, what happens next?

GenAI is already improving IT, from better chatbots to consuming all the data and providing remarkable insights. Yet GenAI is brand new and disruptive. Few observability vendors are using it to significant effect now, and a smaller number can predict the impacts in 24 to 26 months.

Observability can slow the devolution into chaos, pointing to a calmer IT environment with GenAI somewhere in the future. Actual intelligence for the enterprise comes when GenAI consumes data from every source in the company, allowing unthinkable questions and a future where the tsunami of GenAI-created change does not disrupt the company.

Chaos Theory: What Is It?

I’ve mentioned chaos theory a few times. Let’s look into what it is. Chaos theory is a popular trope that allows writers to invent seemingly impossible situations the protagonists must overcome or to base an entire story concept on moving a single item. If any large-scale, easily conceived system can be said to embody chaos, then information technology stands out. Chaos is the normal state of IT, particularly in large enterprises. I’m going to lay out the math for you.

Hold on. Why am I writing about mathematics in an IT blog?

I’m a physicist, and though I’ve been doing IT for over 40 years, I rely on my education for even the most mundane things. Observability and chaos theory are related—the how and why are essential when we look at the entire enterprise. I could have used entropy, but chaos theory is sexier and closer to the reality of an IT ecosystem. Now, to the esoteric math discussion.

Chaos theory has equations that help mathematicians and physicists analyze the systems under study. In 1975, Robert May created a model to demonstrate the chaotic behavior of dynamic systems. I have modified May’s model for incidents:

I_n+1 = r • I_n • (1 – I_n)

- I_n
  - The proportion of the system’s capacity affected by incidents at a given time includes the number of incidents, severity, or the total impact on the system, with the value ranging from zero (no impact) to one (full impact or system-wide failure).
  - In a perfect world, this is always zero, but this is about IT, where the value is never zero. Oh, but we do try hard. NASA has some of the best methods and processes anywhere, but the first place they looked after the Challenger explosion was the range safety code, which can blow up the shuttle. It was deemed perfect after a multimillion-dollar, line-by-line examination.
- r
  - This represents the rate of incident generation and resolution, influenced by factors such as system complexity, change frequency, and the effectiveness of incident management processes. High values indicate a system where incidents are rapidly generated or poorly resolved, leading to a more chaotic system. Lower values suggest a stable system where incidents are effectively managed or are infrequent.
  - In another perfect world, perhaps in the multiverse, this would be equal to or less than one. In this same universe, pigs fly, and nothing ever breaks. I’m sure other strange things happen in this utopia to take the shine off the whole perfection thing.

In another version of Earth, I can simulate every IT element to identify systems and processes on the precipice of chaos and magically heal them. IT does not create dinosaurs, except in the form of mainframe computers running COBOL.

OK, that isn’t happening, but I can monitor all those elements and gather state information (on or off), metrics (memory usage, CPU performance), and more. Then I can send all that information to a team to determine the system’s chaos level and respond accordingly.

Oops, BAM! We have another data glut (monitoring often accounts for 25% of network traffic in a large enterprise).

Observability strives to infer a system’s internal state from its external outputs. We have scads of data but no idea what it means. Observability tooling, whether specifically for public and private clouds, networks, storage, or applications, is a view into the chaos.

The Intersection of May’s Equation and Observability

May’s equation and observability intersect. Here’s how:

- - Understanding system behavior: Observability and May’s equation aim to enhance understanding of complex systems. Observability allows for real-time monitoring and knowledge of a system’s state based on outputs, while May’s equation shows how system behavior can change dramatically with slight parameter shifts.
  - Predictability and stability: May’s equation highlights the limits of predictability in complex systems due to their sensitivity to initial conditions. Observability, in contrast, is a tool for gaining insight into the system. It increases predictability by allowing for early detection of minor issues before they escalate into significant problems. Thus, the value of “r” above keeps our system from exploding into chaos.
  - Adapting to change: The logistic map in May’s equation shows how systems can transition from stable to chaotic regimes with a single parameter change. Observability provides the means to detect and respond to these transitions, offering a method to help manage and mitigate the risks of entering chaotic states.
  - Feedback loops: Observability can act as a feedback mechanism in complex IT systems, identifying when a system is approaching a chaotic regime. This feedback can inform adjustments to system parameters to maintain desired performance and stability levels.

Technology impacts us almost everywhere—doctor visits, the news, social media, refrigerators, and even our cars (including gas-powered vehicles). The change in a single parameter can bring a company to its knees. Ask AT&T about a simple configuration change that brought their entire network down. Look into how British Airways had to cancel hundreds of flights because a software component failed after a simple change.

IT systems are always on the precipice of chaos. Observability tools are one way to examine every IT enterprise’s chaotic state.

Next Steps

To learn more, take a look at GigaOm’s cloud observability Key Criteria and Radar reports. These reports provide a comprehensive overview of the market, outline the criteria you’ll want to consider in a purchase decision, and evaluate how a number of vendors perform against those decision criteria.

If you’re not yet a GigaOm subscriber, you can access the research using a free trial.

Source link