New Best Practices in Virtual and Cloud Management: Performance, Capacity, Compliance, and Workload Automation

New Best Practices in Performance Management

A statement worth remembering. Also a statement too often ignored by many IT practitioners. In the giddy rush to implement virtualization for its benefits, many forget its hidden challenges. These challenges exist across the landscape of daily activities and are often so intertwined with the benefits of virtualization that the costs are easily missed.

Performance management, capacity management, compliance management, and workload automation—listen too carefully to the pundits and vendors, and you'll quickly believe virtualization by itself brings huge operational improvements to each of these activities.

Impressively, it does, but not necessarily all by itself. Virtualization offers the potential to improve these activities. It's the next step that most people forget: Translating that potential into actual improvement. Reaching that goal sometimes requires a little extra help.

The mission of this guide is to assist you in translating potential into concrete improvement. With virtualization having become so popular—some analysts suggest that more than 50 percent of all IT workloads are now virtual—it's time for another look at the best practices in virtual and cloud management. With an eye towards virtualization's original value propositions, this guide intends to illuminate the industry's new best practices, deconstructed into four fundamental activities: performance management, capacity management, compliance management, and workload automation.

Most importantly, this guide will help you recognize where virtualization's complexities go beyond the limits of human attention. To deliver on its promise of optimization, virtualization has to complicate a few things. It's managing the balance between complication and optimization that's become the newest task.

PerfMon Might Be a Joke…

In my role as an IT industry author and presenter, I get the opportunity to stand up in front of a lot of people. In the past 10 years, I've presented to countless thousands of IT professionals. The experience of these individuals spans organizations from enterprise to small and midsize business (SMB) and everything in‐between. These speaking opportunities present me with a lot of time to fill, so I often use that chance to poll audiences everywhere on the big questions I find personally enthralling.

How do people manage system performance is one of those questions. I'm routinely amazed by the response. In those presentations, I often ask, "How many people here have turned on PerfMon on your Windows servers?" The answer, every time: No one. Or, on rare occasion, some statistically‐insignificant number of individuals that's very close to zero.

The audience response surprises me every time, so much so that the question's become a regular joke in such presentations. Its punch line: "So, when someone calls in and says, 'Hey, the mail server is slow today!' What do you tell them?"

Invariably, someone in the back quips, "Have you tried rebooting?"

…But It's Indicative of a Larger Problem

My story is intended to be humorous, but it's also intended to highlight a key failure that's endemic to many of our data centers: Without some measure of baseline monitoring, how can you tell what's different between today's behaviors and those from last week, last month, or last year? Simply put, you can't.

And yet out of the hundreds of audiences and thousands of people who've laughed at the punch line, still 10 years later almost nobody gets the joke. And they wonder why their virtual environments aren't performing to expectations.

I'll admit, PerfMon isn't a great tool for across‐the‐data center performance management nor is it even really the right tool. That said, I see its limited use as indicative of a much larger problem: We as an industry aren't practicing performance management.

That problem is perhaps a result of our industry's hardware successes. Our hardware improves so fast that we've defined a law, Moore's Law, which is still relevant 50 years past its creation. The hardware improvements that are observed by the law are undoubtedly great for our workloads, but they come with consequences that become greatly apparent as those workloads get virtualized. See if you agree with this assertion: Today's class of IT professionals has "grown up" during a period where computing resources were virtually limitless.

Take a look at the performance statistics on just about any desktop or server and you'll surely agree. Figure 1.1 shows a representative screenshot from a sample machine's Windows Task Manager. That machine's eight processor cores and 16GB of memory are all but unused in the processing of its workload.

Figure 1.1: Windows Task Manager.

In fact, some analysts suggest that average processor utilization across all IT workloads, across all industries, lies somewhere between 5 and 10 percent. This news shouldn't be earthshattering. IT's embrace of virtualization is a direct result of the desire to consolidate these low‐consumption workloads. By co‐locating many workloads atop a smaller number of physical hosts, virtualization aims to eliminate exactly these inefficiencies in resource consumption.

But remember, virtualization complicates, even as it optimizes. It is immediately after the initial consolidation that too many data centers stop, subconsciously ignoring the downstream effects that are a direct result of virtualization's goal. Through virtualization, a data center seeks to squeeze useful work out of every resource unit. Greater optimization means greater resource sharing, right on up to the point where resource demands are perfectly balanced with those in supply. And then, all too often, right on past that point.

Performance Management Requires Visibility

Effective performance management first requires paying attention to the behaviors going on inside a system. A running workload requires processor attention, memory for execution space, some quantity of storage, and a bit of network connectivity for communicating with clients and other servers.

Metrics associated with the consumption of these resources can be measured with tools such as PerfMon. From a position inside a computer instance, these tools convert the behaviors they see into numbers. Those numbers can then be compared with known thresholds to identify when a workload is attempting to do too much, or its hardware resources are in too short a supply. Figure 1.2 shows the perspective PerfMon (and others like it) have when they're watching workloads on a physical machine.

Figure 1.2: PerfMon's view into measuring system performance.

Locating the eyeball where it sits in Figure 1.2 works because of the relatively simple nature of physical machine resources. Processor, memory, storage, and networking resources in this situation are dedicated for use by a single workload. Those the workload doesn't use sit idle.

The apparent simplicity here is perhaps another reason more IT professionals ignore performance management. In an all‐physical data center, where each successive hardware generation is more powerful than the last, performance management is sometimes considered an unnecessary activity. Resources are generally in such great supply that a shortage is rarely a problem. Now add a hypervisor, and suddenly the picture grows far more complex.

Figure 1.3 documents the limitations of the eyeball placement in Figure 1.2. It shows a graphical representation of what a system looks like after it's been virtualized and colocated with another. Determining now where resources are being consumed is a bit more difficult. Processor cycles not consumed by the first virtual machine might be used by the second, or they might be completely unused. The same holds true for memory, storage, and networking resources.

Figure 1.3: The single­system view is insufficient in a virtual environment.

"A‐ha!" says the well‐meaning virtual administrator, "PerfMon (and its ilk) has no value in virtual environments! It can't measure hypervisor activities! That's why I use my hypervisor management platform to monitor performance metrics across every virtual machine at once! By monitoring from the hypervisor's perspective, I can measure behaviors across every virtual machine."

Indeed, you can, and in fact, the vast majority of data centers employ a hypervisor management solution (most typically from their hypervisor's platform vendor) for managing configurations and monitoring performance metrics. What they get from such solutions tends to look like Figure 1.4.

Figure 1.4: A hypervisor platform's "PerfMon."

Do you recognize this graph, at least in concept? You've probably seen something similar in your hypervisor management solution. What information can you glean from its half‐dozen overlapping lines? What behaviors does it illuminate? Or, more specifically, exactly what action should an administrator take to resolve whatever behavior is being illustrated here?

Not entirely easy to answer these questions, is it? These are just another bunch of colorful lines. In fact, the data being communicated in Figure 1.4 looks eerily similar to another joke tool discussed earlier in this chapter. What do you do when someone calls in and says, "Hey, the virtual host is slow today!" What do you tell them?

Performance Management Requires Situational Awareness

It should be obvious at this point that the casual monitoring of raw metric data very quickly grows futile as an environment's interdependencies increase. At this point, we know that measuring performance from an individual virtual machine's perspective illuminates only a part of the story.

We are also beginning to recognize that moving the focus onto the hypervisor merely presents a second perspective (see Figure 1.5). The hypervisor isn't all‐knowing or allseeing. With the eyeball pointing at a virtual machine's operating system (OS) and its hypervisor, that view still misses a few key components of the overall virtual platform.

Figure 1.5: Merely a second perspective.

Storage and outside‐the‐hypervisor networking are two components being missed, as are the behaviors going on among hypervisors. Mission‐critical virtualization requires high availability and load balancing. It requires the elimination of single points of failure. It demands redundancy at every level to ensure component failure doesn't mean system failure. Each of these interconnections on their own can be a contributor to performance problems, and each requires independent management and monitoring.

Each interconnection also introduces yet another perspective on the resources that contribute to virtual machine demands. Storage, networking, hardware, the interconnecting fabric—consider how Figure 1.5's "other" layers can impact each other:

  • The storage might experience a resource shortfall that contributes to some upstream problem "felt" by the users of a virtual machine
  • The hypervisor might balloon out memory that's being actively used by a needy virtual machine process
  • The backplane of the switch being used for networking might become oversubscribed by storage traffic, reducing throughput for production networking

It stands to reason, then, that improving situational awareness for performance management tools requires also paying attention to behaviors at the other layers.

That's a lot of eyeballs, but the concept isn't new. As Figure 1.6 suggests, virtually every data center‐class piece of hardware exposes collectable metrics. So do hypervisors and hypervisor toolsets installed into each virtual machine. Storage reports metrics, as do network components. Heck, even the servers themselves expose environmental and other at‐the‐chassis data that can be merged into a more situationally‐aware view.

Figure 1.6: Monitors, everywhere.

Now capture these metrics with some unifying solution. Give that solution the task of collecting behaviors from every perspective, then crunching the numbers. Align the metrics by time, and suddenly that solution sees everything, everywhere, at once. A problem in the network that causes a problem in the storage which then causes a downstream problem with a database server's performance can be better correlated. As you can see, this eyeballs everywhere approach is better able to identify how behaviors at one level impact operations at another.

No Human Alive

Let's swing back to reality, for just a moment. There is a lot of data, and there remains a danger in simply collecting more metrics. The human challenge of having to correlate ever more metrics is what got IT in its PerfMon‐phobic situation today. With that veritable fire hose of measurements coming in at every moment, there comes a point where no human alive can divine meaning out of the numbers. But where humans fail, algorithms succeed.

Here's the part we IT pros often forget: Computer systems, even the highly‐interdependent ones driving our virtual environments, are by nature deterministic. Every behavior that can be measured exists for a reason. That reason is by definition predictable, if one has the necessary algorithms in place. Or, to put it in different terms, what we humans perceive as chaotic is in fact just a system with a lot of variables.

We can't divine meaning, but mathematics can. By replacing constant human attention with a kind of algorithmic black box, we gain a helping hand in processing those metrics and deconstructing the fire hose into something more manageable. What goes in Figure 1.7's black box are all the metrics from your virtual environment's layers; what comes out is a kind of actionable intelligence, or essentially "suggested actions based on actual data."

Figure 1.7: An algorithmic black box.

You cannot easily build this yourself. You can find proof of this in the multiple solutions on the market today that follow the black box approach, each with its own spin on the concept. Notwithstanding their differences, nearly all of them subscribe to the notion that monitoring a virtual platform has to happen at a perspective outside that virtual platform.

What's Inside the Box?

Although the exact detail behind each solution involves some measure of secret sauce, the overarching concept is what's important. In order to understand what's inside the black box, you must first understand the nature of the metrics themselves.

I wrote another book titled The Definitive Guide to Application Performance Management (Realtime Publishers) that discusses in greater detail the framework for this black box monitoring approach. At around 200 pages in length, it provides a deeper‐level discussion on this topic.

The real world can involve hundreds (if not thousands) of metrics across a wide range of potential components. To help illustrate, let's simplify things and play pretend for a minute. In our imaginary data center, we'll ignore all the actual metrics and their names. We'll throw away our preconceived notions of IOPS and CPU Latency, Memory Pressure and Network Packets Received, et al. In its place we'll focus on an imaginary metric: Jeejaws. It's a silly name, but that's my point. It's important here to separate the metrics from what the metrics intend to do.

This separation of metrics from the intent of metrics becomes even more important in Chapter 2 as we extend performance management into capacity management.

In this imaginary world, we can assume that components reporting higher numbers of Jeejaws are simply doing more. A virtual machine, for example, that's reporting 450 Jeejaws is performing more work than one with half that number. The same holds true for the other components.

We've hooked up monitors in this world to all the appreciable places that could impact performance. These correspond to the virtual environment components discussed throughout this chapter. Figure 1.8 shows the monitoring solution in place, where metrics are being collected from the virtual environment's most important locations.

Figure 1.8: Measuring Jeejaws.

Then, one day, we get a phone call: "The Exchange server is slow!" Now what? Suddenly, the performance troubleshooting process becomes quite a bit more scientific.

Well, virtual machine #4 over there is running at 450 Jeejaws today, but hypervisor #2 is showing double the Jeejaws of hypervisor #1. And, check out that storage metric! When was the last time it ran above 600 Jeejaws?

Replacing the prototypical administrator's gut feeling approach is a quantitative measurement that can relate actual performance to numerical values. Quantifying performance then involves identifying what's acceptable, and recognizing when a component's activities have gone past thresholds.

There's one piece remaining: dependencies. As you can image, every component in a system has dependencies on other components. Remote storage, for example, won't work well atop a poorly performing network. The final step in this process involves creating a kind of service model, a hierarchy that defines the traceability between dependencies.

Figure 1.9 shows a simplistic example of what a service model might look like. In it, you can see how the Exchange application's performance relies on an OS. That OS relies on resources from the Virtual Machine object. From there, behaviors within the Hypervisor and Hardware can create an impact, as can Storage and Networking.

Figure 1.9: Applying thresholds to metrics.

Now, admittedly, a production virtual environment's model would look far more complex. A fully‐defined model could involve an impressive branching path of dependencies with virtual machine activities impacting each other, all atop a mesh of hardware interdependencies. This example is intended merely to get you started.

With that model in place, it becomes much easier to see how the activities of one component impact another. Figure 1.9 also displays a stoplight icon on each component. With that visible notification, a troubleshooting administrator can quickly identify that a problem is occurring, then trace that problem to its contributing behaviors. That's the actionable intelligence the black box intends to deliver: Here's where to look to fix this problem.

Step back just a bit from the fantasy, and you can begin to add actual counters that might be valuable in a virtual environment. Knowing the number of storage utilization Jeejaws might not be enough to identify a performance problem's root cause, but the metrics that feed into that value might. Figure 1.10 shows how this could occur for the four sub‐counters that reflect more detail about what's going on with the storage components.

Figure 1.10: Drilling down.

That drill down shows that the performance threshold being crossed relates to storage commands per second and not command latency. These specifics further assist in finding the right action to fix the problem.

A Note on Tuning

You might be thinking at this point, "Well, that's great, but every environment is different, as is every business cycle. Tuning these models can be a nightmare of effort."

It's a valid concern, as are the false positives a poorly‐tuned black box can create. Your measurement of what's acceptable is surely different than the next person's.

Different solutions take different spins on this modeling approach. One key difference is in how the metrics are tuned—manually, automatically, or a combination of both—over time. The solution you want gives you the flexibility to tune your metrics while automatically making adjustments to fit your IT services and your business cycle.

Extending into the Cloud

Not surprisingly, this top‐down approach to performance management continues to serve a purpose as a virtual environment extends into the cloud. Its integration of componentspecific metrics into an overall service model is extremely useful when managing assets atop hardware you might not own.

Figure 1.11 shows graphically one way those metrics can be exposed. Virtual machines in the hosted model tend to be driven by the same hypervisor technologies that power the onpremise Private Cloud. Being the same, those technologies expose similar APIs and/or other endpoints that can be consumed by an on‐premise monitor.

Figure 1.11: Local monitors; remote metrics.

As before, using a unifying solution to consolidate metrics across multiple components— and in this case multiple locations—enables correlating behaviors across an entire distributed system. In Figure 1.11, metrics in each virtual machine, at the hypervisor, and within the provider's storage and networking are collected and processed by the local performance management solution.

Performance Management Requires a Solution

Performance counters are great. They give you an idea of what's going on inside a computer system. But all by themselves, performance counters are an absolute fire hose of information. Unless you're watching them constantly, and making sense of how each impacts the others, it becomes easy to get overwhelmed at the data they present.

That systemic overload can be a key reason many IT professionals today aren't doing performance management (either effectively or at all) in their data centers. The activity is simply something no human alive can accomplish without assistance.

In a world where unlimited resource supply is now considered a waste of hardware investment, today's virtual environments are striving to make best use of every dollar spent. That desire for optimization, as you've learned in this chapter, mandates a change to our old ways of thinking. That change arrives in the solutions that now exist to automate much of the number crunching. Implementing such a solution for performance management in your virtual environment is the new best practice.

Quantifying the activities of performance management isn't the only way these solutions bring an important assist to today's virtual infrastructure. They're also extremely handy at measuring capacity. Although performance management and capacity management are very different activities, the types of new best practice solutions you'll want to implement are absolutely critical for accomplishing both tasks. The next chapter will focus the discussion on capacity management. You might be surprised at how the right focus simplifies answering the question: What do you have versus what do you need?

New Best Practices in Capacity Management

There exists an economy of resources in a virtual environment. Hardware contributes to resource supply, while resources are demanded by needy virtual machines.

It was years ago that I obtained my bachelor's degree, an achievement that required no small amount of study in the field of economics. I counted myself among the few who loved the topic, and for a very specific reason: I had the privilege of learning from one Professor Fred M. Gottheil at the University of Illinois Department of Economics. Dr. Gottheil was a consummate presenter, consistently delivering memorable lectures to his massive auditorium of students.

It wasn't just his presentation prowess that's kept Doc Gottheil in the back of my mind; it was his grasp of the "real world" of economics. He professed that although economics at face value concerns itself with dollars, dig shallowly beneath the surface and you'll find economics in everything. The doc believed, as I do now, that there's supply and demand everywhere: How the number of lanes on a highway impacts traffic flow. How the price of a relationship is governed by emotion. Heck, how even the Golden Rule can arguably be described within the lens of economic theory. Little did I know that these rules would later apply to IT as well, and most specifically to the approaches we use in managing our virtual and cloud environments.

Private Clouds, Resource Pools, and the Supply and Demand of Resources

I begin with this story because IT data centers perhaps unknowingly follow the rules of supply and demand, or at least the good ones do. Today's ever‐increasing embrace of virtualization and cloud computing creates the situation where direct measurement of supply and demand has become the new best practice.

At the center of it all is the resource pool.

Explaining this assertion requires first stepping back from the capacity management activity and focusing instead on what we're asked to manage. In recent years, we've been told that the most efficient approach to delivering virtual environment resources involves thinking like a Private Cloud.

To think like a Private Cloud, one can assume you first need to have one. Yet the question seemingly on everyone's minds is, "What makes a Private Cloud?" Study the IT press and the vendors' marketing glossies and you'll learn that a Private Cloud today enables

  • Availability for individual IT services
  • Flexibility in managing services as well as in deploying new services
  • Scalability when physical resources run out
  • Hardware resource optimization, to ensure that you're getting the most out of your investment
  • Resiliency to protect against large‐scale incidents
  • Globalization capacity, enabling the IT infrastructure to be distributed wherever it is needed

At first blush, this list makes sense in terms of IT needs. We need high availability for our IT services. We want flexibility in managing and deploying new services, as well as scalability to ensure existing ones can expand when necessary. As the first chapter argues, our costs also demand optimization on that investment to ensure we're squeezing out every dollar of benefit. Our businesses lay resiliency and geo‐location requirements on us to protect and distribute assets wherever they're needed.

However, although these marketing‐friendly terms define what a Private Cloud strives to accomplish, they don't say much about what it really is. In fact, our industry's disagreement on a commonly‐accepted Private Cloud definition might just be at the center of our confusion on how to construct one. Maybe we need to simplify. Here's a definition I've used before:

Although virtual machines are the mechanism through which IT services are provided, the Private Cloud infrastructure is the platform that enables those virtual machines to be created and managed at the speed of business.

This definition argues that the hypervisor constitutes a layer of data center resource abstraction. That hypervisor abstracts physical resources to virtual machines, enabling virtualization—and, thus, virtual machines—to perform their tasks. Supporting that hypervisor, then, is an entire infrastructure of hardware and processes, the collection of which enables the hypervisor to do its job. In this description the Private Cloud is that infrastructure.

Yet although it sounds good, this definition doesn't well describe in IT terms what makes that Private Cloud. It defines the infrastructure but not the resources that make up that infrastructure. As a result, I've found myself using a second definition:

A Private Cloud at its core is little more than a virtualization technology, some really good management tools, and those tools' integration with business processes.

This second definition suggests that a Private Cloud is perhaps something far simpler than what we assume, and acknowledges that a Private Cloud requires a hypervisor as well as a set of tools to manage that hypervisor; those tools' functions align with business needs. This second definition argues for simplicity. Even so, I've found it doesn't resonate well with IT practitioners.

I talk about these two definitions first because they introduce my third—and so far best— definition. To me, the following definition best explains the Private Cloud concept in IT terms. It does so by focusing on the resources we're responsible for managing:

A Private Cloud is a host cluster with high availability and disaster recovery services turned on, plus a little bit.

And that's it. A Private Cloud, at its most fundamental, is a resource pool (see Figure 2.1). It is a logical collection of all the little data center resources that IT services—and virtual machines—require to accomplish their mission. Facilitated by the scheduling function of a distributed hypervisor, resources such as processor cores and threads get aggregated to fill the pool. The same holds true with memory, disk, and network resources as well.

Figure 2.1: A Private Cloud is a resource pool.

It's the screen in Figure 2.2 that helped me realize this third definition. This interface is borrowed from one of the major virtual platform players, but the vendors have solutions that are similar.

Figure 2.2: Resource allocation in a host cluster.

In this screen, you see the resource allocation for a host cluster as well as the total CPU capacity for he cluster t     in whole. That capacity is represented as:

It further shows the memory capacity of the cluster, represented as:

These formulas might seem excessively simple when considered within the greater framework of a dynamic virtual environment. But they're valid, and they constitute the foundation of capacity management: How many resources do you have?

Recognize also that those values are but half of the equation. Paired up with them are the values for reserved capacity and overhead utilization as well the quantity of resources that are currently being consumed. These, at a very high level, constitute capacity management's other half: How many resources are you using?

Or, as Dr. Gottheil might say, these two values identify your Private Cloud's supply of resources in relation to its virtual machines' demand. As you should quickly see, there truly is economics in everything.

Abstracting and Simplifying Capacity Management

Chapter 1 discusses how virtual and cloud environments are complex interconnections of hardware, software, and services. Their functionality requires a careful orchestration of components, even as each is managed independently. In the prototypical virtual and/or cloud environment, networks tend to be managed by one set of tools and administrators, storage by a second set, and servers and the virtual environment by a third. Doing so facilitates the separation of duties as well as the separation of administrative domains.

Chapter 1 further argues that measuring performance in such an environment can only happen by watching metrics at every layer, all at once. I reintroduce a figure from Chapter 1 as Figure 2.3 to reinforce those areas where monitors might get targeted.

Figure 2.3: Multiple components require multiple monitors.

Important to recognize at this point is that performance and capacity management are very different activities. The former concerns itself with the experience in using the system, "Is it fast enough today? If not, why?" The latter focuses on what amounts to a single equation, "How many resources do I have, and how many will my workloads need?"

Where the two activities get commonly confused is in the analytics used to answer their questions. You can use the same kinds of monitors to measure capacity as can you for measuring performance. The difference is in the questions you're attempting to answer.

Reintroducing the Jeejaw

To illustrate this activity, let's bring back the nonsense metric first introduced in the first chapter: the Jeejaw. Just like before, the Jeejaw measures some aspect of the various components that constitute our Private Cloud environment. Different here are the questions we'll use the Jeejaw data to answer.

As any virtual administrator knows, virtualization primarily concerns itself with The Core Four: processing, memory, storage, and networking. It's the hypervisor's job to abstract these core four resources and make them available for co‐located virtual machines. Each virtual machine demands a specific quantity of each resource at every point in time: A heavily‐taxed database might need more, while a lightly‐used IT file server might need much less, and so on.

Also important is the recognition that virtual environment resources are shared resources. These resources are highly dynamic, which makes them difficult to measure unaided.

Capacity management (see Figure 2.4) at its most elemental is concerned with ensuring enough resources are available (the supply) to meet the current and future needs of these workloads (the demand). This activity is made challenging by the "messiness" that's intrinsic to virtualization: resources are used dynamically, virtual machines can be relocated anywhere, stuff is constantly being powered on and off. These combine to make the capacity management activity just as difficult as performance management when one has no tools to assist.

Figure 2.4: Abstracting Core Four resources into integer values.

Simplifying by Abstracting

An important new best practice focuses on tools that simplify capacity management by abstracting the data. Figure 2.4 shows a representation of how this might look. In it, a virtual environment's innumerable metrics have been replaced by representative values for each of the Core Four resources. For each value, there is an assertion of the capacity of that resource in contrast with how much is currently being used.

Armed with this information, a virtual administrator can, with a passing glance, get an immediate feel for where resources are getting low. In the example in Figure 2.4, processor, memory, and network resources are sufficient to meet virtual machine demands. Memory resources, however, appear to be running out.

Admittedly, these numbers should be of only limited value in their absolute form. One can argue that a well‐managed Private Cloud will never see a green capacity light go yellow or red. Its proper planning involves acting before resources get low by ensuring more will arrive before they run out.

The problem here is again virtualization's "messiness"—its incredible complexity that goes beyond the analytical limits of the unaided human brain. As a consequence, it is exactly this kind of planning that is incredibly difficult to accomplish without tools to assist. Figure 2.5 shows one such visualization that shows a virtual environment's memory consumption over time.

Figure 2.5: Trendlining memory consumption over time.

Graphs like these are necessary to show consumption trends over time. More important than the actual values is the graph's red trendline. That trendline points to some future date when memory consumption can be expected to exceed available capacity. Your job in the capacity management activity is all about ensuring resources are always available before that day comes.

Wither the Trendline?

That stated, one must be careful with simplicity alone. Trendlines can be insidious, and any IT professional with a copy of Microsoft Excel and a passing grade in statistics can generate them. Like statistics, poorly‐constructed trendlines can lie. A couple of clicks inside Excel, and Figure 2.5's first‐order trendline can be easily converted into a second‐, third‐, or greater‐order slope, any of which can greatly shift the graph's end date forwards or backwards.

The data feeding into Figure 2.5's memory consumption prediction can also lie. Getting good data in a virtual environment can be very challenging. Tools or manual methods that don't include abnormal workloads, consumption peaks and valleys, availability reserve, seasonal and cyclical trends, and virtualization overhead in their equations will generate less‐than‐trustworthy results.

A final challenge is in divining actionable intelligence out of the data you've collected. Consider as an example how the aforementioned Excel spreadsheet might fail when its metrics can't be forecasted to meet the range of possible future situations. Figure 2.6 shows a representation of this, where three possible future states are predicted: The first shows anticipated memory consumption when future business is not expected to change, the second predicts consumption should the market drop, while the third predicts consumption for the case where the business gets acquired.

Figure 2.6: Trendlining across situations.

These kinds of What­If Modeling tools are useful for predicting resource demand and supply and are invaluable in the highly‐dynamic virtual environments that follow Private Cloud thinking. The result of their modeling directly impacts the relationship between IT and a business' supply chain, and goes far in ensuring that IT resources can always meet business demands.

Capacity Management, Converged Infrastructure, and Hardware "Designed with Private Cloud in Mind"

Admittedly, the activities in capacity management are all academic without real‐world actions that are the result of their effort. At the end of the day, capacity management is very much a function of purchasing. One performs its activities to ensure that just enough resources are always available—never too much, and definitely never too little. Buying hardware is often the real‐world action that results.

There is, interestingly enough, a new best practice associated with the kinds of hardware one associates with Private Clouds. The current buzzword for this class of hardware is Converged Infrastructure, although it might better be described as "hardware that's designed with Private Cloud in mind."

Describing this new somewhat‐specialized hardware to experienced IT professionals I've found to be something of a challenge. The prototypical IT professional has gotten used to focusing on hardware and its management as a primary function of their job. Even in today's virtual world, we still think of servers as "servers"—even when they're virtual hosts. A server, and the hardware chassis that encapsulates that concept, is in many ways the unit of management for the average IT practitioner.

Capacity Management: Not Necessarily Always More

Actually, that last statement isn't entirely true. It might be true in today's completely on‐premise environments. In those, there's generally always a need to "keep up with the demands of business." More business ventures usually mean more resources that need to be brought to bear. Such internal services are notorious for rarely being decommissioned, even with IT's usual due diligence in seeing legacy equipment out the door.

Everything changes, however, when businesses begin to extend their Private Cloud environments into the public cloud. There, services are priced not as hardware but on a consumption model. That model naturally incentivizes decommissioning services the moment they're no longer useful, or alternatively, bringing those services in‐house when they've been deemed a long‐term asset worth managing.

Although much of the talk in this chapter discusses how capacity management tends to directly impact new purchases, it is important to recognize that that needn't necessarily be the case in certain Public and Hybrid Cloud situations.

In fact, that recognition might just be one good reason to consider extending into Public Cloud services. Doing so leverages your capacity management tools and experience for cost containment instead of merely always buying more equipment.

White Boxing, Generation I

The hardware that defines Converged Infrastructure strives to evolve that preconceived notion of "server." It does so, at least as I like to describe it, by eliminating our industry's second generation of white boxing.

Bear with me now, because the story makes a lot of sense: Many years ago in the time before vendor‐engineered server hardware, one of IT's tasks involved constructing "servers." In those days, server hardware followed the same approach as does some types of consumer hardware today: You buy a motherboard from one vendor, a case from another, and memory and hard drives perhaps from a third. The process of "building a server" required actually constructing that equipment out of whole cloth, assembling all its individual pieces to create a white box.

This white boxing practice worked well enough in the days before vendor engineering created today's notion of a server. The early practice created downstream problems, notably, in that every white box was a little bit different than the one before. Eventually, the practice was abandoned as we realized the configuration control and stability benefits in buying vendor‐engineered servers over assembling our own.

White Boxing, Generation II

At some point, virtualization became the new best practice, and virtual servers began to outnumber physical servers in data centers everywhere. What many IT professionals don't recognize, however, is that the embrace of virtualization inadvertently created a second generation of white boxing. But this time, we're white boxing our entire data centers.

As a consequence, succeeding generations of virtual environment hardware—whether purchased ad hoc or as the result of capacity management activities—began to accumulate (see Figure 2.7). You see evidence of these activities in data centers everywhere: Virtual hosts that were purchased in groups inadvertently create islands of compatibility for our selected hypervisor. This situation creates big pains: Virtual machines can't migrate across different hosts, and so a single, unified Private Cloud is forced to become a collection of smaller ones. Just like before, our actions have inadvertently created more work for ourselves.

Figure 2.7: Compatibility within purchasing generations, but incompatible between generations.

Virtualizing I/O

Overcoming incompatibility is only one of Converged Infrastructure's goals. Improving the activities in capacity management has also become a new best practice. Converged Infrastructure's hardware aims to accomplish this goal by increasing the level of commonality among physical hardware, while adding Private Cloud‐aware enhancements to the hardware itself.

Many of those enhancements center on reducing the complexity of interconnections between components. Leaning on ever‐faster technologies in networking, these interconnections evolve from being predominantly physical to almost completely logical. As logical connections, their behavioral patterns are far more easily monitored and profiled by a performance and/or capacity management solution.

That's a good thing because it is the interconnections between components that touch everything in a Private Cloud environment. As Figure 2.8 shows, the interconnections exist at the hardware layer, the hypervisor, and even into virtual machines' interaction with storage. Keeping a very robust eye on that network's behaviors goes far towards delivering the kind of data that a capacity management solution requires.

Figure 2.8: Networking touches everything.

Perhaps Not Chargeback But "Showback"

The final topic worth discussing in this capacity management exploration deals with the business' desire for IT alignment. "Aligning IT with the business" has over the years incurred an almost‐humorous series of missteps:

  • There were the "make IT a profit center" campaigns a number of years ago. Many weren't entirely successful.
  • Others attempted an "IT as a business‐within‐a‐business" approach, whereby services were charged back to those requesting them.
  • Even others embraced the outsourcing model, which traded reductions in unexpected costs for a rigidly inflexible service delivery model.

One can argue that all of these missteps attempted to accomplish one thing: Bring business relevance to IT costs.

The tools that facilitate capacity management add another new spin on IT‐business alignment. One such approach eschews the challenging‐to‐implement chargeback approach for a simpler‐but‐no‐less‐effective model called showback.

Recall my earlier ("third") definition of Private Cloud: "A Private Cloud is a host cluster with high availability and disaster recovery services turned on, plus a little bit." I've purposely held off discussing the "little bit" until this point in the story.

Many IT pundits believe that one facet of that little bit centers on self­service. In a selfservice environment, entitled users are given the ability to generate their own IT services at will. Generally, as long as they meet certain requirements—resource use, business need, and so on—such users are free to create, manage, and decommission whatever services they need. It becomes IT's job to manage the templates, ensure everyone plays nicely with each other, and maintain appropriate resource reserves.

As you can imagine, implementing self‐service without capacity management is a recipe for chaos. Lacking capacity management, self‐serving users tend to consume resources until those resources are exhausted. That's not proactive management.

Conversely, a resource pool that is capacity managed has the ability to set hard limits on how many resources each consumer gets to work with. Figure 2.9 shows how the total resource pool in a Private Cloud can be broken down into sub‐pools by project, with business rules defining the percentage each receives.

Figure 2.9: Dividing the resource pool by consumer.

Show Me the Money

Creating resource pools and subdivided resource pools is a feature commonly found in hypervisor platforms. Such pools work for some organizations but not all and not in every situation. What they're missing is that important linkage between IT and the business: the dollars.

The showback model facilitates assigning a dollar value for IT services, then showing that value back to the consumer. Different than chargeback, where costs are actually charged back to the person requesting the service, the showback model brings real‐world dollarsand‐cents valuations to IT services (see Figure 2.10): Need more disk space? That'll cost you ten bucks. Another processor? Fifteen. New server? That'll be a hundred and fifty.

Figure 2.10: Assigning costs to services.

It is in this space where a significant amount of IT leadership is being seen in today's businesses. An IT organization that knows the costs of its services greatly aids business decision making. Budgeting for new projects becomes a science rather than an art. Forecasting becomes as much a budgetary activity as a technology activity. Although no one directly gets penalized when they over‐consume, as is the case in the chargeback model, service consumers are given the business‐relevant data they need to be more successful with their decisions.

Everything (Virtual) Is Economics

Dr. Gottheil was indeed right. Economics are to be found in everything. You can find the principles of economics in the ways IT delivers services to its users. Many of us for a long time haven't applied those principles consciously, perhaps limited by the resource challenges in our early physical environments. Make those environments virtual and begin applying Private Cloud thinking, and suddenly that invisible hand becomes far more recognizable.

Coming up in the next chapter, I'll be changing course to focus on another of the new best practices in virtual and cloud computing. Chapter 3 will leverage the same kinds of data that are collected for performance and capacity management. This time, however, that data gets used for managing compliance and configuration control. You'll find that these three activities—performance, capacity, and configuration management—are more tied together than you'd think.

New Best Practices in Configuration & Compliance Management

I love that quote, and not just because I'm the one who dreamed it up. If you've ever wondered how a system driven by 1s and 0s can be so irritatingly unpredictable, just take a look at its users. The machine merely does what's asked of it; it's the users whose activities are usually the source of chaotic behavior.

Users in this case needn't necessarily be limited to just regular, non‐administrative users. We in IT sometimes forget that we're users too. And, in fact, our activities are often the most impactful on a system.

There's a reason why our activities tend to be more problematic: While regular users interact via published and highly‐controlled application interfaces, we administrators have carte blanche. We can perform any action we want, many of which involve heavy resource consumers. As a result, our administrative actions tend to be more difficult to profile.

This scenario impacts our virtual environments all the time. You know the story: One day you hear whispers that an IT service isn't behaving well. A cursory check of its resources shows nothing of interest. It isn't until much later that you learn some other administrator has been executing a dozen other tasks on the same host. Externalities in a shared virtual environment often cause more impact than the systems themselves.

That said, we must never forget that computers are deterministic, so there's an argument that unpredictable behaviors are merely behaviors you're just not monitoring. As the previous chapter argues, the seemingly chaotic virtual environment is really just one with a lot of variables. Sometimes, as Figure 3.1 suggests, those variables are the actions being performed by that system's various caretakers.

Figure 3.1: Independent actions beget unpredictability.

Think for a minute about the actions that can happen—often simultaneously—in a system of shared resources. One administrator makes a configuration change. Another provisions a new virtual machine or application. At the same time, a third troubleshoots some issue while a fourth is auditing the system for exactly those configurations being changed. Chaos, indeed.

The Black Box Approach to Configuration Management

Bringing order to the chaos requires implementing yet another new best practice: configuration management. Not just any configuration management; rather, a very special kind that fits the unique characteristics of a shared virtual environment.

That "special kind" of configuration management ties directly into the concepts and functionalities discussed to this point. Chapters 1 and 2 discuss at length the need for assistive tools in managing performance and capacity. You should, at this point, recognize that there are limits to how many variables your unaided brain can monitor and correlate. Chapter 2's conclusion further foreshadows the notion that the performance, capacity, and configuration management activities are more tied together than you'd think. When people enact change without coordination, the result is unexpected behavior.

So, what's the solution? As you might expect, the right solution follows the same black box approach discussed earlier. At this point, your black box is already collecting data and generating recommendations. All it needs are a few extra features that correlate what's being changed with what's being monitored.

Figure 3.2 gives you an idea of these new features. By continuously monitoring your environment's inventory, you gather a reference catalog of each configuration item. Centralizing execution ensures that every change is always logged to a central location. Integrating provisioning tasks adds intelligence about why and where resources are being consumed. These three facilitate the fourth new feature, automating remediation, which enables rolling back an environment to a previous state should the need arise.

Figure 3.2: Integrating configuration management into the black box.

As you already know from previous chapters, the primary goal of the black box is to distill the fire hose of metrics data that is constantly being captured. That distillation converts raw metrics into suggested actions that define what you might do next. Although previous efforts focused on the performance and capacity suggestions, that data can also drive a kind of feedback loop that takes into account each configuration change:

  • You make a change
  • The change alters the environment's behavior
  • A further change is suggested
  • Repeat

The benefit of this feedback is in being able to correlate the environment's behaviors with the execution of a change. Doing so, in a way, makes your users as deterministic as the systems they manage.

Enacting Change, on the Virtual Machine and in the Virtual Machine

You must first recognize that changes can happen on any virtual environment element. Virtual machines are assigned varying resource levels. Virtual (and physical) network switches are reconfigured to meet updated requirements. Additional storage is routinely provisioned. And, most notably, configurations inside each virtual machine are modified to solve problems, tweak performance, or enhance security.

Various change management solutions have existed for years that facilitate the process.

Historically, these solutions have tended to focus on what's inside the operating system (OS). They haven't really cared whether the computer is physical or virtual. That disinterest works when change management alone is your goal; it doesn't work when your needs include a change's impact on shared resources.

The historical solutions are also limited by their visibility. They simply aren't designed to interact with a virtual environment's hypervisor or host cluster resources. As a consequence, the new best practice integrates change management activities into the virtual platform's management tools.

Figure 3.3 shows one approach to this design. In it, the same virtual platform tools that facilitate performance and capacity management are used to enact change on hypervisor objects as well as inside the virtual machine. Agents installed into each guest machine facilitate the installations, updates, or other configuration changes that source from the virtual platform manager. These agents are necessary because they are installed into the virtual machine's OS. This installation allows them to execute actions in the correct administrative context.

Figure 3.3: Virtual platform agents…of change.

Not Necessarily in the Box

This added functionality isn't necessarily going to be part of the core tools a virtual platform offers. Those core tools facilitate performing actions on virtual environment objects; however, they tend not to add all the extra features this chapter is referring to. These features usually cost extra.

Wielding the Vast Power of Undo

Unifying configuration management with the additional best practices suggested in this guide isn't the only benefit of this approach to configuration management. When your virtual environment enjoys continuous inventory collection and centralized change execution, you gain a very key additional benefit: the universal undo.

Recognize that this isn't just any old undo. Rather, it is one that's backed by all the feature sets intrinsic to virtualization. The right solution will have the ability to back out changes to hypervisor objects, returning them to previous configurations. Figure 3.4 shows a report on a change, detailing what happened, where it happened, and when it was detected. With the right solution, undoing these changes can happen via a combination of real‐time detection, identification, and remediation of the offending configuration. This series of events ensures that an errant change can quickly be removed with minimal impact.

Figure 3.4: Who did what, when?

Certain changes can also be rolled back with the assistance of virtual machine snapshots. These snapshots are built into every major virtual platform today, and provide a way to return a virtual machine back to the snapshot's point in time. With them, you should easily see the possibilities in orchestrating universal undo:

  • You instruct the solution to execute a change
  • The solution snapshots the virtual machine prior to execution
  • That change later needs to be removed
  • The solution reverts the virtual machine back to the snapshot

The experienced virtual administrator should at this point recognize that this functionality can be a double‐edged sword. Snapshots are absolutely useful for reverting virtual machines, but keeping too many snapshots lying around can be a bad thing. Every linked snapshot impacts performance as well as consumes storage space. Snapshots have also been notorious for creating problems when they're allowed to exist for too long.

For these reasons, you should expect additional logic built into the solution can later consolidate snapshots after a predetermined period. Doing so ensures that snapshots don't remain resident for extended periods, essentially offering the best of both worlds. The right solution balances both requirements, delivering a reduction in change‐related risk while limiting the risks inherent to the snapshots themselves.

Useful for Anticipated; Really Useful for Unanticipated

I can't overstate the benefit in offering global undo functionality. Not only anticipated changes can be rolled back when they're found to be ineffective or problematic but also unanticipated changes. Remember that the defining point of this solution is to centralize the dissemination of changes. When that solution also supports rollback, administrators gain the ability to click once and back out any change they find.

Configuration Management Feeds Compliance Management

You can't talk about configuration management these days without a nod to maintaining compliance. Compliance management in this discussion refers not only to external guidelines handed down from regulation and/or security policies but also internal guidelines that ensure systems are configured to meet best practices. As you can probably guess, the activities in compliance management fit perfectly within our black box approach:

  • Services feeding data into your black box are constantly capturing inventory information
  • That inventory information contains auditable configuration items on hypervisor objects as well as inside virtual machines
  • Centralizing provisioning and the execution of changes through the same system ensure configuration updates are similarly captured
  • Feedback keeps everyone honest

Masking a few of these out for a minute (Figure 3.5) allows us to focus on two capabilities that are central to compliance management's new best practices. One of those, compliance templates, is new to our black box diagram. Compliance templates are exactly as they seem: lists of configurations that define specific settings required to meet an established baseline.

Figure 3.5: Templates and remediation.

These baselines can be anything. One baseline might identify settings that define a corporate security policy. Another might outline requirements handed down by an external regulatory agency. A third may include settings that follow performance best practices.

Your limit is only your imagination.

What Makes a Template?

Notwithstanding what its baseline attempts to accomplish, a prototypical template generally comprises three elements:

  • The configuration item
  • The compliance rule
  • The validation code used to verify the setting

That third element, the validation code, is where compliance templates offer their greatest

utility. It is at the same time the least likely to be directly managed by a virtual environment's administrators (but more on that in a moment).

The job of the validation code is to automatically validate compliance settings on a monitored object. Recall from Chapter 1 that virtually every data center‐class piece of hardware exposes collectable metrics (see Figure 3.6). So too does every hypervisor object as well as most configuration items within an OS and its installed applications. Now augmented with its configuration management functionality, our black box now possesses the ability to execute code against those objects.

Figure 3.6: Monitors, (still) everywhere.

So, why not use those features to automate each object's validation?

You can do so with the right solution. In fact, that automation enables the compliance management activity to occur constantly and generally without administrator intervention. Such a solution can evaluate compliance templates with each inventory collection pass, and then again when changes are either invoked through the system or identified as having occurred.

What results is an easy‐to‐read report that outlines the settings on each object that aren't meeting the baseline. Figure 3.7 illustrates a representation. Reports like these ease the job of auditing for auditors and security officers. These reports also ease troubleshooting, giving administrators a heads‐up warning when an errant change has taken an object out of compliance. Similar to Chapter 2's capacity management stoplight charts, these reports enable you to drill down from these high‐level monitors to expose additional detail.

Figure 3.7: Automatically validating compliance.

Centralizing change execution needn't necessarily stop with running validation scripts. The contents of these reports can further drive another capability: automated remediation. That remediation process leverages aspects of the template's validation code—often in cooperation with administrator input or established policies—to automatically correct any inappropriate settings the moment they're deemed out of compliance. The activity then simplifies to four principle steps: inventory, validate, alert, and remediate.

Overcoming the Achilles Heel: Who Authors the Templates?

All of these templates and automations offer incredible benefit until you realize a fairly sizeable Achilles Heel: They can be cumbersome to construct. Their challenge is two‐fold. The first part is in translating a regulation's requirements into specific settings that require monitoring. Try this one, as an exercise:

PCI‐DSS Requirement 6.5.9 requires the implementation of controls that inhibit Cross‐Site Request Forgery (CSRF) attacks. It specifies a testing procedure that states, "Do not reply on authorization credentials and tokens automatically submitted by browsers."

Which of your virtual environment's objects need monitoring to meet this requirement? What interfaces expose the necessary information? Exactly what code needs to be written to verify their configuration and remediate it when it's found to be incorrect? The answers to all of these questions are complex, and likely too complex for many administrators to answer comfortably.

The problem's second part is arguably more insidious because it pits your opinion against the opinion of the person performing the audit: Assuming you determine an answer to the first part, how do you know yours is the correct answer?

All by yourself, you don't. Or, more specifically, you don't without testing your assertion against those of your auditors. Unlike the other activities discussed in this guide, compliance management is above all a test of interpretation. Most compliance mandates (internal or external) are purposely written to be open to interpretation. Their purposeful vagueness can create big challenges when opinions on implementation disagree.

Federating Interpretation

One solution that is swiftly becoming the new best practice involves mutually trusting the efforts of others. The concept works much like identity federation, where the authorization to use a service requires trusting the successful authentication from an outside authority. In the case of compliance management, the authority is an external agency while the service is the verification of compliance. These external agencies have gone through the effort to research and publish their interpretations. These are security best practices like those developed by the Defense Information Systems Agency (DISA), the National Institute of Standards and Technology (NIST), and the Center for Internet Security (CIS). They can be vendor best practices such as VMware's and Microsoft's Hardening Guidelines. They can also be industry and regulatory mandates such as the Sarbanes‐Oxley (SOX) Act, Payment

Card Industry (PCI) standards, Health Insurance Portability and Accountability Act (HIPAA), and Federal Information Security Management Act (FISMA).

The compliance template in this case is the literal interpretation of the regulation's mandates. When that template is made available by the regulatory agency that owns the mandate, you can reasonably assume that meeting their template's guidelines goes far towards meeting their regulation's guidelines as well.

An important disclaimer: Even with a set of mutually‐trusted templates, this activity is still subjective. Your templates might not cover every aspect of your entire solution. They might only relate to those the virtual platform touches. That said, using templates that your auditors already trust goes far towards helping them fulfill their due diligence.

The last point to be made respects the notion that even externally‐trusted templates won't meet all your baseline needs. A good solution will provide a mechanism for ingesting existing templates as well as constructing your own for any environment‐specific requirements that aren't captured elsewhere.

Activity Consolidation: The New Best Practice

By this point, you've probably recognized the central theme behind this guide's new best practice assertions: activity consolidation. By following the black box approach and layering each activity's features on top of the other, a virtual environment gains a single pane of glass for managing every behavior. With a well‐constructed metrics engine, performance and capacity can be managed via the same platform that handles configurations and compliance.

The next and final chapter concludes this conversation by adding one more major activity to the model: workload automation. In Chapter 4, you'll learn how the bundling of configurations into runbooks and policies goes far towards automating large‐scale and complex activities. As before, all these best practices fit together to create a cohesive management experience.

Workload Automation in Virtualized and Cloud Environments

There's no question that automation has become increasingly important in recent years. Much of that focus is a direct result of the sheer workload IT is expected to handle. There are simply too many servers, clients, and resources these days for each to be managed manually and individually.

You can argue that the increasing workload is a product of prior successes. As virtualization began enabling ever‐faster creation of ever‐more machines, IT excitement quickly turned to fear. That (incredibly warranted) panic centered on the effects of virtual machine sprawl. Suddenly awash in a sea of new computer instances, IT staff realized in hindsight that, "When a thing becomes easy to do, we do that thing."

Complicating those fears is the unbounded promise offered by public cloud computing providers. They tease, "Come to us when you have needs. Our resources are (effectively) unlimited." Happy words like burstability and elasticity get casually thrown around, while shallowly under the surface lie others more concerning: consumption­based pricing is one; pay­as­you­go is another.

The cloud indeed offers limitless resources, priced by the hour—or whatever pricing scheme you negotiate. But using those resources smartly requires some up‐front intelligence. Are they even a good idea? When are they affordable? When will they break the bank?

Metrics are obviously the answer. Gathering those metrics are the very monitors we've discussed throughout the entirety of this book. Those metrics and monitors bring quantification to the variety of behaviors in a virtual environment. They illuminate intelligent decisions one should make when pondering what to do with all these everincreasing workloads.

That said, metrics and monitors are but the intelligence. They provide the information. Actually making change in a measured and predictable manner requires a fourth new best practice: workload automation. Herein lies the key to bringing everything in your data center back under control: Not just any automation will do. The smart data center recognizes that intelligence and automation go together—intelligent automation. Asserting that control begins by letting the computers manage themselves.

Why to Automate

Few organizations consider virtualization without realizing they'll need some measure of automation to make it manageable and reach a greater level of efficiency and agility. That's an easy assumption, but it's only the first step. Look past mere virtualization and towards cloud management, and you'll quickly realize that automated management isn't the only driver.

Each chapter has introduced new management activities that tie into a black box approach

(Figure 4.1). The first three chapters argue that performance, capacity, and configuration/compliance management benefit from the logic built into the box. The approach should now make sense. After all, the more your technology knows about itself, the better it is at managing itself. That's a fact that becomes increasingly important as a data center scales to the very large.

Figure 4.1: Intelligent workload automation.

This fourth and final chapter concludes the conversation by tying workload automation into the operations management experience. Critically important to recognize here is the intelligence the black box can add into each automation.

As organizations consider the tasks they've traditionally regarded as overhead, the term automation begins to take on broader meaning. In addition to day‐to‐day management, organizations begin to realize automated operations might be an option for other tasks they may have never considered. Here are three important terms: provisioning, reprovisioning, and deprovisioning.


Automations in the provisioning activity are perhaps the easiest to comprehend. After all, the point of a cloud management infrastructure is to accelerate the creation of new servers and services. What's important to recognize, however, is that provisioning in this context needn't be driven only by "creating new virtual machines." It can and often does also reference the ability to provision entire services on demand.

Doing so requires automations. But, as you've already learned, this method requires being smart about how those automations perform their tasks.

One critically important measurement is the impact of potential new servers and services on those already existing. The metrics captured inside the black box can ensure that creating a new service won't be an impactful event. You don't want to spin up 50 new highload virtual machines in the middle of the day on a host running critical workloads. Automation enables a kind of situational awareness in places that are too complex to manage on your own.


The reprovisioning activity can be a bit less obvious, but is no less important.

Reprovisioning focuses on the delivery of wholesale configuration changes that are the result of some other activity. For example, imagine your business suddenly falls under a new legal or regulatory requirement. That requirement must be immediately implemented across a large group of servers. Reprovisioning goes about updating those servers with the new requirement in a measurable and predictable way.

Workload Automation and Configuration Management Are Linked Activities

You might be asking yourself, "Isn't this merely configuration management with a different buzzword?" In a way, it is. Automating the delivery of new servers and services and later reprovisioning updates all involve changing an environment's configuration. Thus, configuration management is indeed involved.

That said, the actual act of actually delivering each change can be accomplished through a variety of means. You can click buttons manually. You can script the change. You can leverage tools that automate change delivery for you.

Workload automation strives to accomplish that delivery without requiring cumbersome and manual effort. As you'll learn in a minute, the new best practice leans on tools to assist.


A third activity is deprovisioning, which focuses on the activities in removing servers and services from an environment. This task might seem simple, but deprovisioning correctly can require far more planning than one might think. Indeed, you must occasionally delete a virtual machine when it is no longer relevant. Where workload automation becomes critical is when that decommissioning needs to happen automatically and based on predetermined environmental conditions or as a course of life cycle management.

An example can help here. Consider the situation where some IT service requires one or more Web servers. A single Web server might be necessary when user load is nominal. More than one Web server might become necessary when user activity exceeds a certain threshold. Later on, when activity returns to nominal, those extra Web servers are no longer necessary.

Workload automation enables IT to provision preconfigured Web servers to meet the increasing user load situation. It further enables the deprovisioning of those servers the moment they become unnecessary. This job is a configuration task, but it is also a monitoring task.

The key capability driving these decisions is the intelligence built into the black box. It monitors for performance, so it is aware of available capacity. That gives a well‐constructed automation the intelligence it needs to deploy (and later decommission) the necessary resources.

It is worth restating here, "When a thing becomes easy to do, we do that thing." Workload automation in today's virtual and cloud environments has evolved past merely speeding up patch deployment or updating a few user accounts. The new best practice seeks to impact every aspect of a workload's life cycle, from creation through ongoing management and all the way to end‐of‐life decommissioning.

How Does One Automate?

You need to ask yourself, "I get it. How then do I automate?" The answer depends on your needs and the solution you're using.

Part 1: Scripting

Scripting and the use of scripts have long been the go‐to approach for automation. In recent years, Windows PowerShell has evolved to become a primary script environment. VMware's vSphere offers robust scripting support via its PowerShell‐based PowerCLI toolkit. Microsoft's Hyper‐V uses it, as does Citrix's XenServer to a lesser extent.

Kits such as these enable administrators to quickly create scripts that perform a variety of tasks. Those scripts take time to create, debug, and perfect, but that time is commonly viewed as an investment. The scripter expects that the task once automated will take less subsequent manual effort to perform: pay now, benefit later.

At the end of the day, scripts are code. They're vastly extensible, but they can be a pain in the neck to generate. The code that follows (see Figure 4.2), for example, is a small portion of a much larger PowerShell script. In it, the New‐VM cmdlet creates a virtual machine. The configuration of that virtual machine is based on the list of parameters that are supplied. Once created, the script's next cmdlet—Start‐VM—then powers on the newly‐created virtual machine.

New‐VM ‐Name $VMName ‐OSCustomizationSpec $Customization ‐Template $Template ‐VMHost $VMHost ‐Datastore $Datastore ‐RunAsync Start‐VM ‐VM $VMName ‐RunAsync

Figure 4.2: An excerpt from a PowerShell script.

There's incredible power here, but there can also be incredible problems. Getting value out of scripts first requires a significant investment in scripting. Administrators with the right expertise are also required, and not every administrator has the aptitude or the interest.

Pressing on through and learning from failures is a further requirement. Administrators without formal software development experience can generate code that is less maintainable, less robust, and/or less reliable than it needs to be. Their efforts are less reusable—less modular, which reduces the return on investment in creating scripts.

Scripts have another downside, too, in that they don't tend to be designed for accessibility by anyone other than a skilled administrator. Often, they're meant for use exclusively by the person who wrote them in the first place. As a consequence, they're often written with unavoidable quirks or compromises that limit their effectiveness in true automation. Poorly‐constructed scripts might have to be manually run and monitored by that skilled individual or are not parameterized to a level where their reuse is cost effective.

Essentially, a poorly‐conceived script isn't "shrink‐wrapped," in that others can't just pick it up off a shelf and use it elsewhere.

Most importantly, most scripts also lack intelligence. Although it's easy with a modern hypervisor to script the creation of a new virtual machine, there's more to automation than merely performing the task. When should the task be performed? On what hosts? For what reason? What happens next? Although a script can certainly be programmed to answer these questions—reaching out and checking schedules, verifying capacity, measuring current workload, and so forth—the sheer number of decision and data points can mean rapidly scoping a simple script into a major development effort.

Scripting's Biggest Problem: The Scripter?

A decade ago, I was a systems administrator for a large defense contractor, and was known for finding automation solutions for many of my job's manual tasks. The scripts I created worked great while I was responsible for them. They created problems after I left that employer. Many "automated" tasks were in fact completed by a small piece of code that no one else knew existed—until one day it ceased to function. I still get calls every so often when one of my little buried automations stops working.

In the end, I might have created more problems over time than I solved. Luckily, my employer still knows how to find me. Not every business is so lucky.

Part 2: Objects and Runbooks

So if scripts are good but scripts are bad, then what's an automation‐seeking virtual environment to do? One approach is to create an "overarching management solution" that focuses its energies on reusable management objects. Although each object is really a script at heart, its creation as an object needs to ensure that it possesses the necessary input, output, and processing components that facilitate its use within a greater framework.

Figure 4.3 shows an example of how three objects can be collected to enact a change. These objects don't necessarily eliminate the scripts themselves. Rather, they encapsulate them into specific units of functionality within the context of an overarching management solution. The Measure‐Performance and Validate‐Capacity functions in Figure 4.3 represent those functionality units.

Figure 4.3: Wrapping scripts into management objects.

This approach aids in making scripts more production‐friendly and more reusable. It lets organizations treat scripting automation as a solution rather than a series of break‐fixes. The scripts themselves gain as well. Those generated within a larger framework tend to be more structured, more reliable, and more maintainable over time.

The framework has a name as well: runbook automation, or RBA. You can think of RBA as the environment in which all these automations are orchestrated (e.g., constructed), scheduled, and tested. The results of their actions can be reported on. Most importantly, the overarching RBA solution can take cues from the black box intelligence to help make moreinformed decisions.

That situational awareness lets RBA tools be used as remediation systems for specific problems. Once tied into monitoring (Figure 4.4), a poor performance condition can trigger an RBA action to remediate the problem and alert the administrator.

Figure 4.4: Integrating monitors into automations.

A key benefit of an RBA framework is that it needn't require specific functionality built into the technology it manages. For example, as long as the resource you're attempting to automate has scripting exposure of some form, the resource can be acted upon by objects in the RBA solution. Less‐mature resources can be provisioned, managed, and deprovisioned through RBA tools, without needing any specialized automation "hooks" inside the resource itself.

RBA can very obviously facilitate back‐end administrative activities, but smartly‐designed ones can also expose those activities to front‐end users. Herein lies the advantage in selfservice. With enough intelligence built into the system, the provisioning, reprovisioning, and deprovisioning tasks become exceptionally well‐suited for RBA (and, thus, self‐service) execution. You can see why self‐service—with the right user in mind—is quickly becoming the new best practice.

An Aside: Private Cloud to Hybrid Cloud

It is worth pausing this conversation for a quick tip‐of‐the‐hat towards automation's role in managing resources that sit outside the local data center. An automation framework is useful for automating workload activities in a private cloud, but it is arguably more important when considered in the context of a hybrid cloud.

A key characteristic of hybrid cloud computing is burstability, which is the ability to augment and extend services wherever and whenever necessary. The public cloud portion of hybrid cloud is by definition a (virtually) unlimited pool of resources. Your cloud provider relationship makes available a reasonably unending capacity of resources that you can consume when your needs require.

Those resources don't come inexpensively, nor are they priced in ways that are familiar to traditional IT. Hybrid cloud computing's central hurdle is arguably its pay‐as‐you‐use costing model. This model does wonders for eliminating capital expenditures and reducing the impact of unused resources, but it can be painful on the monthly bill when eagerness exceeds actual usage.

The problem is that data centers tend to always need more: more resources, more machines, and more services. Left unchecked, this tendency towards always more can quickly create a cost problem in the pay‐as‐you‐use public cloud. One counters that problem with the resource usage quantifying assistance of the black box, or, in plainer English, "You've got to know what you're using if you're to know what it costs."

Another key aspect of a hybrid cloud environment is the seamless flexibility that exists between local and non‐local resources. Figure 4.5 shows a representation of various workloads that could be burst into a connected provider's public cloud.

Figure 4.5: Workload automations extend into the hybrid cloud.

These workloads include activities such as:

  • Live migrating virtual workload to free local resources
  • Spinning up additional resources when demand dictates
  • Powering down resources when demand decreases
  • Relocating workloads when their usage patterns make more sense in the public cloud space

Important to recognize here is the connection between the private and public cloud entities. That connection facilitates the backward‐and‐forward flow of workload processing as demands change. A virtual machine that starts its life in the private cloud might later be better positioned in the public cloud (or vice versa) due to the conditions of the day. Alternatively, an IT service designed for the private cloud can burst onto public cloud resources when its user load increases.

Part 3: Policies and Autonomous Management

Policy‐based administration is generally held to be more automated, more proactive, and more desirable than standalone scripts or even RBA tools. In a typical policy‐based administration setup, administrators define groups of desired configuration settings. The technology being administered—a hypervisor, for example—is configured by its vendor to read and understand these policies, and to configure itself to ensure their compliance.

This model is significantly different than scripting and RBA. Rather than giving the hypervisor the steps to reconfigure itself, you simply configure it with the end configuration state you desire. Armed with that information, it is then enabled to perform whatever is necessary under the hood to meet the requirement.

The policy‐based approach offers certain distinct advantages, particularly in large and wellcontrolled environments. In these situations, when a desired configuration changes—such as some new operational requirement—one simply alters the top‐level policies. Your resources can then adjust themselves accordingly. When paired with an enterprise‐ready solution, the approach can scale to the near‐infinite levels one expects in a public and/or hybrid cloud environment.

Important to recognize is that policies aren't scripts, nor are they configuration objects like those used in RBA. With policies, you're not running a script against 10,000 machines. Rather, you're communicating a new configuration target to those machines, then letting them intelligently reconfigure themselves.

Policy‐based administration works well for static configuration items such as security settings or broad operational parameters. Policies can be used for automation, as well. For example, a policy‐enabled hypervisor can work with policies that define contingent actions, such as what to do in various failure scenarios. They can also define instructions for when a host's resource consumption exceeds its capacity. Policies enable a virtual platform to automatically respond to operational conditions without having to wait for monitoring to notice the problem and alert an administrator who then runs a script or runbook to remediate the situation.

As you can imagine, a policy‐ready hypervisor has to be a smarter hypervisor. That hypervisor must understand the impact of workloads, their resource usage, and their importance to your business operations. That hypervisor must understand something of a workload's history as well as its configuration. Armed with this intelligence, the underlying system can know what the workload is intended to do. It can then make smarter decisions—guided by policies—at every point in time. What this setup creates is an environment of autonomous management, which is arguably the ultimate form of automation.

Autonomous Management: An Example

Autonomous management might sound like a far‐fetched concept, but its foundations are merely another approach in combining intelligence with actions. For example, suppose a virtual host recognizes that it is reaching capacity. It knows this because it has been configured to monitor for this exact situation. It also knows to analyze its list of policies to match what it sees with what it should do.

Your policies might prioritize demand so that highly critical workloads are left alone if at all possible. Other resources may be migrated to another host. Very low‐priority workloads might be suspended entirely until the situation has resolved.

All of these actions can be initiated automatically because they are triggered by recognizable behaviors. By merely classifying workloads, one can implement policies to manage those workloads more effectively. Workloads with changing priorities can also be easily reclassified. Has a certain workload become more mission critical than it was in the past? No need to reprogram a bunch of scripts. Simply reclassify the workload and let policies handle the rest.

Part 4: Self‐Service

The entirety of this conversation is designed to lead towards a fourth and final automation: self­service. The notion of self‐service involves moving certain tasks from the responsibility of the virtual administrator onto the requesting individual. Self‐service has gotten significant attention lately due in part to the increasing complexity and scale of many virtual and cloud environments. Simply put, IT environments are becoming so large and so well‐automated that it can begin to make sense to let users manage their own resources.

Self‐service needn't focus exclusively on servers. With the right automations in place, one can offer self‐service for entire services that users need (Figure 4.7). Such self‐service needn't necessarily expose the entire gamut of administrative controls, merely those that make sense for the self‐serving user.

Figure 4.7: Self­services can be servers, but it can also be entire services.

Who Is the Self­Server?

For many in IT, the notion of self­service raises the hairs on the back of the neck. To this group, their understanding of self‐service runs counter to the charter of the IT organization: managing computing assets.

What's interesting is that these people might be incorrectly considering selfservice's end user. In many cases, that end user won't be the regular, run‐ofthe‐mill person who happens to need computer resources to accomplish a job. It can be, but the everyday user still today relies on IT to actually manage the services they consume.

The intended user of self‐service in most situations is usually someone else in IT. That's because self‐service's automations directly align with a growing trend in IT.

Consider for a minute your IT organization. You probably have people who manage the data center infrastructure. Others manage the applications atop that infrastructure. Self‐service enables the first group to automate the delivery of computing resources to the second. It is a natural extension of what most IT organizations already do today: Someone needs a resource; a different person gives them that resource. Self­service merely eliminates extra work for the middleman.

Key to this offering is the creation of a bounded experience. One obviously can't allow (even trusted IT) individuals to start spinning up virtual machines whenever they need. Their experience requires controls: Control over when that happens, where the workload is hosted, how many resources they're allowed to consume, and so on.

These controls become the logical extension of self‐service's separation of duties. The virtual administrator maintains environment performance and capacity; the requestor works within the boundaries they're given. Maintaining those boundaries happens via the same data being collected by the black box. With the right tools in place, the black box and the infrastructure itself helps to provide those boundaries. And, again with the right tools in place, freeing designated users to manage their own resources is absolutely becoming the new best practice.

The Right Tools

You can probably surmise by this point that the thesis behind each of these new best practices is that you can't effectively do this unaided. Even with all their benefits, today's virtual environments have become just too complex to recognize greatest benefit without assistance. Odds are good that you've already implemented your hypervisor of choice along with its management platform; odds are also good that management platform isn't enough. Also needed are the additional services one gets from the extra tools that integrate performance, capacity, and configuration and compliance management with workload automation.

Further, if you've struggled with understanding your virtual environment's fit into the greater cloud story, your challenges might be directly related to exactly those services you're missing. Effective tools like those you've learned about in this book are becoming ever‐more necessary to first help you understand that fit, then take advantage of everything the cloud has to offer.