Operational efficiency and effectiveness in the business environment requires a keen understanding of the complex nature and interactions among software, hardware, and personnel. It is also necessary to automate as many processes as possible with a definable, repeatable pattern of execution or flow as possible. Today's complex IT systems contain more constants and variables than any single human mind can process and comprehend in a meaningful way, simply because humans are ill-equipped to handle so much information at once.
Within modern IT operations, there is tremendous unharnessed potential. There are also immense gaps among diverse, multi-disciplinary architectures, infrastructures, and platforms in use. To compound these problems, IT must cope with rapid proliferation of issues related to absorbing existing business infrastructures, acquiring new equipment, dealing with human error, and working from an often incomplete view of centralized management and task automation across heterogeneous environments.
ITPA represents the convergence between routine tasks and best practice initiatives addressed through software-centric approaches. Traditionally, automated interactions have been handled using batch scripting and on-site custom programming tools and techniques. Modern computing and telecommunications enterprises process millions of transactions through diverse organizational infrastructures and various systems that execute thousands of different tasks, many of which are similar in nature, though individual execution details may differ. Establishing cohesive operation and centralized software management across an entire IT infrastructure and all its various components and elements presents significant challenges, particularly during routine system maintenance, and can drain resources and time. In fact, task automation's biggest appeal is the time and effort it frees so that IT staff can do more than put out fires or deal with the crisis du jour.
Indeed, certain kinds of IT systems come equipped to support task automation. Thus, for example, systems management products (backup applications, monitoring services, and Help desk platforms) often include prepared scripts intended for site-specific customization to automate peripheral tasks once such products are installed. But the success of such automation is often limited when seeking to automate incident response or problem management procedures. That's because scripted automation suffices for most basic tasks but fails to scale well across multiple products and platforms. This goes double in environments in which operational parameters experience changes to processes and platforms depending on local execution environments.
Today's enterprise network landscape incorporates numerous discrete but interrelated infrastructure elements—applications, databases, services, and hardware—and encompasses a variety of management disciplines, interfaces, tools, and dashboards. Typically, these elements are lashed together with chewing gum and baling wire. Nevertheless, the expectation is that such a patchwork assemblage will work cohesively, even though in practice the cohesion among such diverse sets of components is seldom transparent and never seamless.
Figure 1.1: Modern enterprises must often rely on multiple, separate, and incompatible monitoring and reporting tools with piecemeal script-based automation.
Run Book Automation (RBA), as coined by Gartner research, represents an emerging technology space architected around various sets of standards. Early adopters turn to RBA to address basic enterprise needs for coherent, end-to-end task automation across the IT landscape. These early adopters also typically seek to fill the gaps using turnkey solutions instead of patching site- or system-specific solutions together.
Run Book Automation (RBA) is a term coined by Gartner research and is interchangeable with IT Process Automation (ITPA) as used by other research firms. Throughout, this text uses both terms and abbreviations to sidestep any potential confusion between them because they describe the same principles and properties in IT management. A more thorough definition of RBA also appears later in this chapter.
Increasing application complexity and platform interdependencies only adds to the difficulties that IT operations managers and administrative support personnel experience on a daily basis. Creating a cooperative environment among various separate and unrelated applications, processes, and people is a daunting task all by itself. It's an effort that can stretch the imagination and exhaust employees as they attempt to keep pace with a dynamic and ever-changing enterprise IT landscape. Also consider that many IT environments inherit other platforms and platform-specific issues when businesses merge, with no clear or concise way to unify hitherto separate and unrelated management and maintenance processes.
Furthermore, the rising costs of associated labor are both undeniable and exponential. For every dollar spent on hardware, approximately $4 is spent on related human resources to govern them. A major contributor to this problem is the direct connection between personnel and platforms, which translates to an overburdening of personal responsibility when it comes to maintaining and managing routine operational aspects of an IT infrastructure. The increasing resources devoted exclusively to repetitive manual tasks is both costly and concerning—nearly 60% of IT staff resources are devoted to handling daily routines and activities within enterprise networks.
To better understand the concepts, capabilities, and objects related to RBA, one must first understand the key elements that drive and influence ITPA within a large-scale data center environment. In turn, this means understanding workflow and issues that pertain to scripted automation, as discussed in the following sections.
As any computer network administrator will tell you, there is a surplus of redundancy and repetition in the daily routine of manually administering, managing, and maintaining the enterprise network infrastructure. Many of these tasks negatively impact operational efficiency, resolution timeframes, personnel response times, and so forth. This is particularly problematic where a diverse set of systems and hardware platforms (for example, AIX, BSD, HP-UX, Linux, UNIX, and Windows) often work alongside one another, each with its own operational peculiarities and particular management needs.
Automation is a key aspect to operational efficiency in practically every large-scale computingbased scenario. The ability to maintain a state of self-servicing equipment and software leverages time and resources in a manner that lends greater flexibility to often-overworked administrative personnel, freeing them to engage in other necessary work-related tasks. That said, establishing automation on a broad scale requires the ability to define and prioritize the activities, events, processes, procedures, and tasks related to the operational aspects of the IT infrastructure.
Figure 1.2: Workflow includes all aspects of enterprise infrastructure and is indifferent to platform, service, and application differences and incompatibilities.
A majority of the repeated actions used to maintain and manage functional computer networks can be defined as a set of procedures and processes. In turn, these may then be translated into fully automated, self-administered routines. The term workflow describes any predictable, repeated pattern of activity provided through systematic organization of resources, roles, and information into a documented and iterative work process.
Typically, a workflow will be designed to attain processing goals: service provisioning, information processing, or other tasks related to enterprise IT infrastructure management. A workflow consists of multiple discrete processes—best understood as narrowly focused IT tasks, defined by their inputs, outputs, and purposes—that can be planned and scheduled as a logically, partially ordered set of activities intended to accomplish a specified goal.
A process is any set of individual tasks coordinated into a series or sequence of logical events, which can encompass many applications, data sources and tool sets. A process is the basic atomic unit of any workflow procedure.
Unfortunately, current patterns of workflow and system-related activity are not selfsustainable—that is, these things can't take care of themselves. Hence, the majority (60%) of IT resources—including time, energy, personnel, and so forth—are expended to deal with mundane housekeeping tasks related to business computing environments. This leaves only 40% of available resources to handle all other aspects of IT governance and operation, which is clearly neither a model of efficiency nor a situation in which growth and innovation can thrive. Factor in the unwanted complications of administering dynamic, growing elements of the IT infrastructure in conjunction with existing ones, and the problem grows geometrically complex and invites a multitude of distinct possibilities for human error.
Reliable as the best support staff may be, the exigencies involved in properly maintaining a large and complex infrastructure invariably fosters man-made problems. Absentmindedness, unverified changes, incalculable inaccuracies and other problems introduced via human fallibility add to underlying hands-on management problems. A support staff that's already stretched cannot expect to cope with unforeseen complications that arise from even the simplest mistakes at the same time that it seeks to improve operational capacity or efficiency.
Repetitive human activity and action in an environment devoid of automated scheduling, tracking, and audit enables errors to occur and mount up. Human error also incurs all kinds of unnecessary costs—in time, money, and labor—for correcting the consequences, and leaves little or no resources available for planning growth or for future expansion.
Another inherent issue with scripted automation in the enterprise environment is the simple fact that many scripts lack agility and flexibility, and thereby hinder business process flow. Scripts require programming expertise—an IT personnel skill set that is better utilized on strategic projects that add business value as opposed to maintaining or automating infrastructure components. In general, IT professionals understand what must be done to automate workflow; alas, scripting often requires more knowledge of how to capture such knowledge in the proper form.
When pre-packaged scripts come furnished along with some management tool or framework, they tend to be task-oriented rather than process-oriented. This explains why such scripts are not generally usable outside their intended (and narrow) applications, or suitable for adaptation for other environments or purposes. Even more vexing is that site-specific scripted batch jobs draw on tribal knowledge from within the IT work environment. As the employee base changes over time, or individual possessors of such knowledge move into other roles, this can create gaps in understanding and often means that successive generations of throwaway scripts come and go on a regular basis. Likewise, scripts that scan critical events to issue trouble tickets also tend to be highly specialized. Such scripts may even employ proprietary code specific to some particular product or code base. By nature, these scripts work only for a particular vendor or event, and require updates as and when related monitoring tools or Help desk platforms are upgraded, modified, or replaced.
As the operational environment changes so too must all applicable and relevant scripts. New programs and platforms, upgrades to existing systems, and staff responsibilities remain in a constant state of flux, and directly affect interaction within broad and diverse infrastructures. Often, new scripts emerge as soon as old scripts can no longer be modified to accommodate new business logic, processes, or procedures. Ever-changing and sometimes broken scripts can provoke unscheduled and unwanted downtime, increase consultant costs, and burden those experts who create and modify them.
Reality dictates that even the simplest changes can cause negative cascading effects and create problems that may be entirely baffling, except perhaps to a script's original author or its maintainer. Furthermore, change can also introduce new errors, especially by change authors that seek to pick up and patch legacy scripts who may be hampered by an incomplete understanding of a script's inner workings. Here again, this poses another perfect opportunity for human error to creep into the management cycle. In fact, the amount of effort required to fix an existing script may occasionally far outweigh whatever benefits it might confer. The sad result of this kind of tail-chasing exercise is to sometimes prevent management of much-needed change, or even the timely application of such change.
Because so few elements in enterprise IT infrastructures are consciously and deliberately designed and configured to operate in a transparent, fully coordinated manner, this poses further obstacles to achieving self-sustaining IT governance. Simply put, there are no established standards for product integration for IT operations management. Too many unrelated tools, tasks, techniques, and technologies are involved that do not directly correlate, interact, or coordinate in a cohesive, centralized fashion.
This creates discrete islands of technology that must be bridged by human efforts and crafty resourcefulness rather than standard tools, objects, and methods. This institutionalized chaos tends to inspire ad-hoc, situation-specific solutions in the context of particular operating environments, platforms, products, and so forth. In such situations, workflow is entirely dependent upon individual efforts and site-specific solutions and usually involves scripted interactions between pairs of specific machines, applications, or services.
The real problem with scripts is that they are static, task-oriented executables not well-suited for automating processes. Such isolated bits of code typically include no knowledge of—or ability to model or otherwise accommodate—process automation. Because scripted code is subject to inherent constraints in its semantic models and offers only limited ability to interact with other systems, script developers must supply the smarts and elbow grease necessary to bring dissimilar systems and services together. This is a sad misuse of IT time and talent that could be better spent working on value-oriented tasks or new capabilities.
Consequently, a constant cycle of change—itself also a largely manual process—is error prone and inconsistent, often creating other complications. Complex hand-offs between various groups and diverse systems within an IT infrastructure introduces thousands of interdependencies, not all of which may be readily apparent, well understood, properly documented, or previously encountered.
Constant, unremitting alerts, application changes, and platform updates affect various levels and tiers of the IT infrastructure from applications and services to networks and systems. These effects also ultimately impact end users and support staff. To put all this into different terms, a diverse enterprise landscape often creates an unacceptable and unnecessary waste of resources for what are essentially basic "lights-on" activities—those constant, monotonous cyclical rituals and routines associated with IT operations and management tasks and procedures.
Initially, frontline IT operators handle general alerts and incidents—they are the "first responders" for any network-based enterprise anomaly or emergency. Out of the entire incident pool, nearly half are escalated to higher level (tier 3) systems administrators and network management. From start to finish, a single escalated incident can incur multiple time and performance penalties, as support personnel field the initial call, then set up bridge lines and conference calls as escalation kicks in and more staff become involved.
For on-call support personnel, this might mean receiving an after-hours page and an unscheduled commute into the workplace (unless the situation can be resolved remotely, which is far too uncommon). Every step down this path consumes staff time, postpones resolution, and delays response time in any number of significant ways. RBA is designed to counter and mitigate these responses and to avoid unnecessary consumption of staff time and resources whenever possible.
Today's business environment alters the level of automation and integration required within IT operations. Traditional automation practices primarily utilize job schedulers to run and monitor batch jobs and scripted tasks to perform essential functions in production environments. However, without tight integration into the surrounding environment, job scheduling is not wellsuited to automate operational processes or run book procedures.
Typically, a run book will contain procedures to start, stop, and monitor systems and networks. These procedures translate to processes such as mounting archival storage volumes at some predetermined scheduled backup period. This is the very essence of a workflow. Where scripts and schedulers address small, simple tasks, a run book can scale to handle complex environments. Run books also possess reporting facilities that enable operators to audit progress trails and verify process compliance. As requirements for tracking and verification grows along with increasing demand for new functionality, a script and scheduler-based approach produces a complicated mixture of scripts, programs, and utilities that few will understand. The script and scheduler-based approach itself becomes a full-time programming commitment and timeconsuming management burden. By contrast, RBA takes a workflow-oriented approach to automation; embraces all kinds of systems, platforms, and protocols; and treats auditing, tracking, and reporting as core components in its operation.
RBA encompasses the ability to define, build, coordinate, and manage products, processes, and procedures implemented to improve and enhance operational IT efficiency. An RBA process can include all management disciplines and interact with various forms of infrastructure elements. Such infrastructure elements include endpoint applications, server services, and platform hardware. RBA bridges multiple application, data, and departmental silos through an IT-defined workflow that tightly integrates the people, processes, and technologies involved through welldefined and -documented operational procedures.
Figure 1.3: RBA lets IT define workflows then RBA maps a coherent, global view of the workflow into all necessary local and specific instructions to platforms, services, and hardware.
With RBA, an enterprise can:
According to a recent Gartner study, the growth of RBA coincides with the need for IT executives to deliver higher efficiency to their workflows—typically by decreasing Mean Time to Repair (MTTR) and increasing Mean Time Between Failure (MTBF) values and averages— aspects that describe the reliability and success of any IT operation. Incident response time is also a crucial factor for IT momentum; while many environments employ advanced monitoring and performance tracking solutions, they embed them within largely manual processes that can't help but produce compromised turnaround timeframes because of the human interaction (and delays) that are perforce involved in such systems. In strong contrast, RBA automates the diagnostic and resolution steps to accelerate the overall incident resolution process.
RBA addresses critical customer requirements by providing a visually guided step-based automation process. Workflows may begin using multiple modes of initiation—including from an operator, automatic kick-off, or per some specific schedule. RBA uses simple, intelligible building blocks to provide an intuitive authoring platform so that seasoned IT professionals can pick up its tools and put them to work immediately, even if they have never written a code or a line of script. RBA building blocks include simple drag-and-drop workflow visualization tools, wizard dialogs to help capture workflow element details, a wealth of out-of-the-box content and templates, as well as an ability to incorporate existing scripts. Integrated filtering and flow logic enables quick, intelligent decision-making with facilities to document, report, and create audit trails for workflow implementation. Finally, the entire framework resides within a role-based security environment so that only authorized and authenticated IT users can access or manipulate workflow descriptions, data, or reports.
RBA solutions mirror the feature and capability sets required from mundane job schedulers. To that base level of functionality, however, RBA confers the added benefits of a flexible, scalable framework that delivers more and more advanced functionality. RBA software automates administrative, business, and maintenance processes that include data backups, log rotation, service restarts, file deletions, storage mounts, and email reports. RBA can also initiate numerous independent jobs in parallel across multiple machines, coordinate between concurrent tasks, create or update user account information, interact with database systems, and transfer archival data. And unlike scripted batch jobs, RBA meets typical enterprise network reliability and availability requirements such as load-balancing, failure and failover routines, error handling and reporting, and audit trail logging.
Furthermore, RBA provides real agility and flexibility. These characteristics are essential to support a constantly and dynamically changing business environment in a way that allows operations staff to design, implement, manage, and maintain their own processes and workflow without bringing programming expertise or consulting services into the mix.
Three other essential characteristics of RBA include integration, orchestration, and workflow:
RBA tools must also enable and provide multiple rapid resolution functions, such as:
These features of RBA tools enable fast automated response to alerts and incidents with full documentation, reducing reliance on tribal knowledge. They also capture real-time information related to diagnostics, remediation, and root-cause correction and determination. Over time, this data stream helps many enterprises improve upon existing incident handling workflows and create new ones as they're needed. In more conventional incident handling, as frontline IT operators dispatch and handle ongoing incidents, approximately half are routed to tier-3 systems administrators and network management for input and intervention. This often ropes support personnel into hours of meetings and conference calls and can extend to after-hours and weekends. The whole purpose of RBA is to transform such an ad-hoc, largely hands-on, tribal knowledge-oriented process into a collection of streamlined, self-maintaining workflow procedures that involve only minimal human interaction, and reduce the need for escalation and escalating staff time and resources.
There are several crucial aspects and elements typical of RBA operation. Realizing an adequate and complete technology solution that provides pervasive, comprehensive coverage—core architectural underpinnings to IT infrastructure management—is what makes RBA both possible and practical.
The need for greater IT efficiency and agility—freeing up that 60% of overstretched resources— fuels executive management's demands for IT optimization solutions. This is where RBA can help IT achieve better alignment with business objectives. RBA does this by leveraging existing infrastructure investments to create a more agile, dynamic infrastructure that automates, integrates, and orchestrates IT operations without requiring increased numbers or skill sets in IT staffing.
Cataloging each aspect of the information system infrastructure is usually handled through a Configuration Management Database (CMDB), a repository of such related components. CMDB, which stems from the Information Technology Infrastructure Library (ITIL) context, helps an organization define and understand the relationships between various components and track their configuration parameters—a fundamental component of the ITIL configuration management process.
ITIL is a set of concepts and techniques for managing IT infrastructure, development, and operations, published in a series of books that cover individual management topics. ITIL is a registered trademark of the United Kingdom's Office of Government Commerce (OGC), and gives a detailed description of a variety of important IT practices.
Central to the RBA technology solution is the ability to gather, analyze, and draw conclusions or make decisions based upon defined IT processes that make businesses work. A crucial factor in the successful design and execution of RBA is the ability to automate processes across a diverse set of hardware and software platforms that share few common traits but many similar responsibilities. Unlike simple ITPA, RBA entails visual scripting to manipulate objects using simple, intelligible building blocks and wizards to ensure that each building block contains all the necessary parameters and information, all without any need for explicit programming or scripting.
RBA tools bridge the digital divide between the diverse platforms that populate enterprise networks, and the various parameters that configure and control them, by creating a common toolset for collecting and analyzing information pooled from any number of sources. Hence, RBA tools can also help enterprises to employ a complete perspective on all interactions between each aspect of their business environments—in many cases, for the first time ever.
ITPA solutions are primarily comprised of three components: orchestration, integration, and automation. These combined elements enable a more hands-free IT organization that can be largely automated especially across a heterogeneous infrastructure.
The 80/80 rule states that an average of 80% of the IT budget is spent on maintaining existing systems—primarily, the same basic problems within and among those systems—that nearly all organizations encounter. There isn't much difference from one enterprise to the next when persistent problems are common to every organization. The biggest impact of this debacle is that it leaves such a small portion of the budget for new IT initiatives.
In the same vein, 80% of downtime results from human error, owing primarily to a lack of vendor integration and enterprise tie-ins among various platforms. The manual intervention this requires introduces human latencies in service delivery and creates far too much opportunity for errors of omission or commission to occur. The key to reducing this operational overhead and needless waste can come only from better task, service, and resource prioritization, and improved usage of resources, which are exactly what RBA provides.
As a working example, a global service provider—with a focus on large-scale mobile, high-speed Internet access—operates a distributed network of more than 35,000 devices for hundreds of thousands of customers. Growth rates of 100% per year are typical, and their diverse application environment includes many hundreds of servers across diverse platforms in hundreds of locations around the world.
The company seeks to use a network automation system that can scale with its explosive growth and provide remote automation capability for all devices in the network that would be easy to administer from a single location. Another key requirement calls for a Web-based UI that any browser can access from any location because installing client applications on hundreds of administrator workstations was not feasible nor could their client run-time environments accommodate a new Java client without creating conflicts with other critical applications.
Other challenges include plans to bring up several thousand mobile high-speed sites in the next planning year. Following the waterfall model, this company's new sites begin their lives in the hands of the Design team, to architect how the network is to be deployed, including specs for network design, hardware, and basic configuration recommendations. Next, the site is passed to the Deployment team, to handle configuration and rollout of physical devices and plant. After that, Deployment passes the site to an Operations team, where it joins a plethora other sides already under their management umbrella. Operations staff includes network operators and engineers who handle all maintenance, management, and monitoring going forward.
The implementation RBA helped this service provider increase the speed at which new site rollouts occur, while keeping operation costs relatively flat. By controlling device configurations carefully and automating wherever possible, sites are kept compliant with corporate standards as they move from Design to Deployment to Operations. Once operational, site maintenance and management is easier, faster, and less error-prone, primarily thanks to minimal opportunities for human error to occur. Bottom-line result: automation makes a single engineer as productive today as five engineers were under the old, manual hand-off regime.
Currently, the top three categories of systems management tools include:
An organization's run book is defined by a written set of standard operating procedures relative to the operation of systems or networks by the administrator and supportive operator roles. It contains procedures for starting, stopping, and monitoring systems and networks, handling special requests and dispatching exceptions to better meet or exceed IT goals and reflect changing business needs.
As modern organizations now possess up to thousands of applications and application servers, rely on larger globalized networks, and retain higher amounts of data than ever before, it's vital to have a comprehensive and efficient manner of managing and maintaining the environment. Analysts at ADC and Gartner predict an unwavering upward growth trend as business adapts and expands to a growing global economy that increasingly relies on global resources. As such, both the IT infrastructure and its support personnel are increasingly asked to do more with less, add value to the existing business space, run IT like a business, and minimize headcount additions. This perpetual growth trend gives rise to the emergent RBA market space.
Modern organizations possess more applications and servers than ever before. Operating footprints are becoming increasingly dense, with more businesses turning to repurpose and better utilize existing infrastructure resources—particularly in the realm of virtualization technologies. More hardware platforms are achieving more goals through strategic and logical partitioning of resources in an ever-growing need to expand operational capacity without impact to operational footprints. Essentially, the IT environment is striving to do more with less.
Modern organizations also rely upon larger global networks—disparate, disjointed regional and territorial networks are merging into much bigger national and international entities. And with business-to-business implications, these ever-expanding network environments are also joining forces—be they temporary or permanent—with companion networks and partner companies. Comprehensive management over a single network landscape with all its intricate resources is a delicate issue alone, never mind the often conflicting and incompatible issues that arise between several different network environments leveraging all kinds of interoperable and exclusive technology solutions.
With this dynamically changing landscape comes an increasing demand on data retention and processing needs. Companies quickly outgrow the vast digital storehouses kept on-site and turn to larger, denser storage volumes and technological advances. In any given enterprise, it's common to find several individual terabytes of network-attached storage (NAS) devices that themselves comprise voluminous petabytes worth of business-related information. How, then, does one manage this overwhelming data in a centralized, comprehensive manner?
What does all this mean in practical terms? Like the virtualization technology solution, IT staffers are also urged to better prioritize responsibilities and ration time, reduce wasted efforts on the frontline, increase their presence, and manage the IT infrastructure like its own business entity—all without needlessly hiring-on more personnel to offset the current burden.