Tuesday, 15 August 2023

A Prescriptive Path for Digital Resilience

The four-step process shown in this picture was used throughout Splunk’s recent .conf23 user conference. It was used with different titles, different messages before the arrow swooshing upwards, and with different elements within the product suite linked to the segments. The objective is to show a journey of continual improvement by enhancing organisational ability to respond to problems.

Capability Maturity Models

A prescriptive path is essentially a Maturity Model, a mechanism defined in 1986 to measure organisations ability to develop software reliably. This original model was developed at Carnegie Mellon for the US military as a means of assessing software suppliers. This classic model has five steps:

Initial (sometimes called disorganised or chaos)
Managed
Defined
Quantitatively Managed
Optimizing

Most organisations were at level 1 — and most of them still are. The objective was to ensure timely delivery of large-scale software projects that met its requirements. Despite many people building complex processes, the sad reality is that building large software projects is genuinely difficult. Even when processes are applied rigorously software grows organically and despite everyone’s best efforts results in a Big Ball of Mud. Generally these failures are project management antipatterns, however there is also a parody “immaturity” model which rings all to true.

A Maturity Model to Digital Resilience

The original maturity models were driven by the desire to deliver hugely complex software systems from sprawling specifications; the classic waterfall model. As such it is largely based on managing projects, things that have beginnings, middles and ends. A maturity model for digital resilience, however, has to be designed for continuous operation and to have objectives that are as clear as delivering functionally on time.

While the proposed journey reads well and ties into the company’s range of products, underneath it lacks the logical progression of the typical maturity model. Admittedly the term maturity model isn’t used, but the intent is clear. The four steps in the prescriptive path are:

Foundational Visibility
Prioritized Actions
Proactive Response
Optimized Experiences

The first three make sense, but the leap to optimization is premature, and different versions of the chart off different forms of optimization. For the model to work across the whole product range, it needs to encompass the goals of all. It also needs to bridge the gap to optimization; while active response is important, in itself it doesn’t cover the gamut of resilience.

Building a true Digital Resilience Maturity Model

As I have already mentioned, all organisations need to consider their digital resilience. A maturity model is a great way of providing a structure to assess and achieve that. It could even be used as a form of certification. As I’ve been considering Splunk’s prescriptive path over the last few weeks I’ve been thinking about what the ultimate form of this model would take and to whom it would apply. Ideas forming but not yet ready for publication.

Of Course There Was AI at Splunk .conf23

Every product announcement in 2023 is required to include AI. It’s the law. However most of them are vague, fluffy and just to keep the investors happy. Only a few offer meaningful benefits. Splunk’s new offerings are in this latter category, eschewing wild claims and big noise, but respecting the importance of humans and the real issues that SREs face in their daily working lives. And refreshingly they are not all generative AI.

Not AI Newbies

The new AI capabilities were presented by Min Wang, who became Splunk CTO in April 2023 after five years as an ED at Google where she worked on the Google Assistant. While that consumer-focused product may seem adjacent to SEIM, that’s not the point. It’s the experience of rapidly and easily assisting people that counts, augmenting their productivity.

This is also not Splunk’s first sortie into machine learning and AI, having launched Splunk ML in 2015. This means that the company not only has deep AI-related experience, it already has a mass of data on which to draw, whether for threat and attack detection, automation or analysis. This means that Splunk have domain-specific models built from a solid base, unlike a generic tool such as ChatGPT which lacks any contextual knowledge.

Responsible AI, Really

The current boom in AI has lead to a crazy amount of AI washing, irresponsible releases and endless foolishly exaggerated claims. There are a number of basic rules that should be applied to launching any AI product, and Splunk is one of the few organisations actually following them.

The first rule of responsible AI is to always maintain a human in the loop. All too often cost-fixated management try to use AI as a means of eliminating staff, often with catastrophic results. Since security and incident management is a cost that doesn’t directly drive revenue, all too many executives will consider it a grudge purchase.

This means that the Splunk AI tools do not take action directly, always confirming via a human what action, if any, should be taken. This approach maximizes the likelihood of the correct action being taken by combining human insight with the scale of machine detection, while minimizing the risk of humans missing something and machines jumping to the wrong conclusion.

Explain Yourself, Machine

Another rule that should be respected by AI systems but all to often is not even considered is auditability. This means the ability for the machine to explain why it did something or recommended a particular course of action. Just like with a human, you can ask the reasons behind choices. This is particularly important when there is a chance of legal exposure or an insurance claim, where every little detail of an incident will be dissected and examined.

The Splunk AI Assistant is exemplary in this aspect. Its main purpose seems to be helping SREs write custom filtering code in SPL2. Most AI-enabled tools which simply belch out a bunch of stuff and expect the human to check if it is correct.

The Splunk AI Assistant uses a conversational approach to go way beyond this. The user can ask it to create a query in plain language (presumably only English at present), and the tool returns the code with a line-by-line explanation of what it does and why. This serves three vital functions:

Providing an audit history that explains why the actions were taken.
Allows experienced engineers to spot errors.
Trains junior engineers in developing SPL2 scripts.

Most of us would say that handling an outage of any kind is a learning experience, with the AI Assistant Splunk can turn any operation, routine or emergency, into an educational opportunity. This helps those who may be experienced operators but lack programming skills to develop an understanding of scripting. And given the shortage of staff in the area, this is a useful tool for skills transfer and onboarding new SREs.

Revenge of the Command Line

I found it amusing that amongst all the gorgeously designed, highly-visual user experience tools offered by Splunk, the latest and greatest tool was essentially a return to a command line. Chat interfaces, however, turn the traditional command line interface on its head as the AI Assistant learns what the user wants and needs, not the reverse. No more figuring out the right command and its options, just use plain language.

AI and ML Transfusion

ML and AI capabilities are being rolled out across security, observability and platform Splunk products. Given the existing ML heritage this is no sudden move to keep the vultures on Wall Street happy, but part of a longer roadmap to AI-enable a broad range of capabilities. This careful approach enhances the product line appropriately where AI and ML will add the most benefit to security and reliability professionals.

Big Picture Thinking at Splunk .conf23

What technologies and capabilities, especially new ones, are behind Splunk’s anchor message Building Digital Resilience at their recent user conference, .conf23? The good news for Splunk and its customers is that there was plenty to back up the message. Even better, the new capabilities were wrapped in a message on tool consolidation and collaboration within teams. I’ll call that big picture thinking because it embraces a whole raft of different roles and circumstances where the product capabilities can be deployed.

Bringing IT Together

There were numerous small features announced that clearly delighted the Site Reliability Engineers (SREs) in the audience who whooped in delight. One feature greeted with applause and shouts was unified identity for accessing both Splunk Cloud and Splunk Observability Cloud data. While you would probably have expected it to exist already, this demonstrates that the products are coming together to form a single suite.

Unifying SecOps on a single work surface is the objective of the latest iteration of Splunk Mission Control, and we saw some very nice demos. The user experience is delightfully consumer grade, which is important for many reasons, not just aesthetics:

It helps bring new people into the world of SRE by reducing the learning curve, a theme that applies to several of the new features.
It reduces error rates by making processes and information easy to follow and key data immediately visible.
It eliminates switching between multiple different experiences, accelerating time to finding the trouble.

There really is no need for enterprise software to be dull and grey any more. Progress. Mission Control brings together Splunk Enterprise Security, Splunk Attack Analyzer, and Splunk SOAR. Combining these critical services reduces stress during incidents, and allows for teams to work more easily together normally.

Reflecting Hybrid Reality

The reality of digital infrastructure is that it is fragment, poorly understood, and barely documented. Most organisations rely on a mix of on-prem systems, potentially in multiple locations, multiple cloud providers and a mix of SaaS products. This fragmentation clearly adds to the complexity of running the estate, as well as creating many more edges than were previously present with the potential for increasing the attack surface.

Two quotes from Splunk CEO Gary Steele’s keynote stick in my mind: “you can’t secure what you can’t see” and “you can’t operate what you don’t know exists.” Documentation, where it exists, is usually wrong, sometimes dangerously so. The extension of technology into every branch of business means that things are constantly being changed, with the decentralization of IT budgets resulting in a proliferation of applications, devices and vendors.

This organic growth is why observability is so important, however the long and passive-sounding word observability itself doesn’t help. All too many IT people are reluctant to admit to not knowing what is on their network and most management are clueless about the whole thing. Observability deserves a dynamic upgrade with a more active, even aggressive, title.

Owning the Edge

Edges, furthest from central control, are all too often where things go wrong. As mentioned above, current fragmented, federated IT estates introduce many more edges over which data has to flow. Responding to this need, Splunk has launched two products.

The first is Edge Processor, a software appliance that implements data transformation pipelines for ingesting information. It uses the second generation of Search Processing Language (SPL2) which allows for continuity across the platform and reusing the

The second is Edge Hub, a hardware appliance. Yes, hardware. Splunk is working with partners to deploy this device. It’s small but heavy, with a chunky heatsink on the back, but a surprisingly bright touch-sensitive display panel on front. This is not your classic hardware interface, but very much a consumer-grade experience. I am curious to know how this will go down with gnarly industrial engineers.

The device is a veritable Rosetta Stone of industrial protocols, as well as having built-in detectors for temperature, vibration, sound and video. These are designed to support a wide range of operational applications, ranging from monitoring cabinets in a data center or the full industrial spectrum of conveyers, pumps and similar.

The objective is to connect currently unconnected devices, essentially bridging the OT and IT worlds, allowing OT data to be added to the overall pool for analysis. This has the potential to detect all sorts of new trends and opportunities for optimization. It will be very interesting to see what can come from this, I can see immense possibilities for sustainability initiatives.

On the other hand I can see the old antipathy between OT and IT resulting in disagreements and less than optimal implementations. This is where the partners come in. Working through partners is a wise move from Splunk. The selected partners already have the trust of the OT world, with the serious operational credentials that Splunk does not.

Monday, 7 August 2023

Building Digital Resilience

Digital resilience is a critical topic about which I've already written and spoken. Digital resilience should be a concern for every business as everything in every business now depends on digital technology. Unfortunately most business management are ignorant of their total dependency on something they fundamentally don't understand. It's just something that IT does, right? No, wrong. Completely and dangerously wrong.

I was recently a guest at Splunk's .conf23 user conference, a fabulous confection of announcements, technical sessions, customer stories, fez-wearing MVPs, and a giant inflatable pony called Buttercup. Splunk has grown from log analysis into what is now called SIEM, allowing technical teams to find what has gone, or is going, wrong in their networks, especially during security incidents.

While Splunk is a leader in SIEM tools, a growing part of Splunk's toolset is observability tools where it's leadership has also been recognised. These help businesses know what devices and applications are on their network, what data is flowing where, potential vulnerabilities, and other information that a naive business person might think were well known. The sad reality is that most businesses of any size have only a rough idea of what systems they have, how they are connected, and which parts are most important. Of course there's documentation, but it's out of date and mostly wrong. Some key individuals know in what way the documentation is wrong, but don't have the time to correct it; after all they are key individuals. Sometimes they don't have the inclination either, and prefer keeping that knowledge to themselves.

Clearly SIEM and Observability belong together, with observability providing essential context for security and other incidents. I was, therefore, delighted to find that the big theme of Splunk .conf23 was how the combination was helping Build Digital Resilience. I will be writing about some of the details of new announcements in subsequent posts, but I was really delighted to find the message of Digital Resilience at the core of the vision presented by President & CEO Gary Steele. This was reinforced by Gretchen O'Hara, VP of Partnerships and Alliances, who called digital resilience a board-level topic.

Unfortunately all too few companies share her enlightenment. While legal and financial non-executive directors are the norm for many boards, the are precious few that see the importance of having independent technical expertise available at this level. The rash of ransomware and other cybersecurity incidents demonstrates the critical nature of our digital infrastructure. It's time for the importance of digital resilience to be fully recognised with a seat on the board.