The 4 Roadblocks of Data Preparation

Why does data preparation take so long?

When they're moving at the speed of business, data analysts have limited time to spend analyzing data before it goes stale. If analysts are forced to move slowly, they risk arriving too late to capitalize on potentially profitable transactions, investments, customer marketing opportunities or social media events.

Why should business analysts have to sacrifice precious time for data preparation tasks like accessing, cleansing, normalizing and blending disparate data sets? Ventana Research reports that companies using predictive analytics spend 40 percent of their time preparing data for analysis and 22 percent accessing the data — the least gratifying parts of the analytic process. Considering that those tasks are not as essential to decision making as building and deploying models, it's no wonder that companies see them as bottlenecks. Similarly, Blue Hill Research reports that most analysts spend 40 to 60 percent of their time preparing data, and whatever time is left over on analysis.

This e-book examines the four most common roadblocks posed by data preparation, with potential solutions for overcoming them so that business analysts can spend more time on data analysis. Readers will take away insights into overcoming the roadblocks in their own organization.

Blue Hill Research reports most analysts spend 40–60 % of their time preparing data.

Wide variety of data sources

Many analysts find themselves with more data sources than they can easily manage. The diversity of data sources per se is not a roadblock, but the process of deciding which ones to use and then finding out how to access each one becomes a roadblock.

Three trends contribute to data preparation bottlenecks:

  1. Analysts possess training in particular systems but are expected to work in many others. Writing a SQL statement against an Oracle database or a SQL server database may be second nature if that's the area in which those analysts were trained. But when asked to start working in a different type of data environment, such as Hadoop (NoSQL), and deliver useful content, they may be completely lost.
  2. New, unstructured data has come into the picture. Analysts may be able to find their data easily in traditional sources: enterprise applications, relational databases and other structured sources with the use of APIs or interchanges. However, new sources such as Hadoop and social media streams are areas where many analysts are still acquiring skills.
  3. A hybrid IT role has emerged: shadow IT. Increasingly, business analysts are being thrown into an IT role. While competent and knowledgeable in disciplines like finance, marketing and manufacturing, they are not trained in IT, so they struggle with systems and IT infrastructure to get the information they need. With little or no technical training, they have trouble determining which data sources are most helpful and how to access them for data prep.

The diversity of data sources per se is not a roadblock, but the process of deciding which ones to use and then finding out how to access each one becomes a roadblock.

How to Overcome This Barrier?

While complex data ecosystems are here to stay, most organizations can deploy tools to deal with them.

When evaluating a data preparation solution, look for a tool that delivers the solution through a graphical user interface (GUI). The GUI allows the user or business analyst a packaged approach to interacting with the data environment, including capabilities like drag-and-drop to display relationships among data points. GUI rendering of the data allows users to understand how different platforms are delivering the output. Seeing the differences allows the data preparer to reconcile those differences and combine the data effectively.

When evaluating a data preparation solution, look for a tool that delivers the solution through a graphical user interface (GUI).

The GUI allows these experts and novices a canvas on which to scale and share insights across the enterprise. The data preparation diagrams they create are more readily explained, modified and shared.

Too many tools

Currently, working with a variety of data sources requires multiple tools, forming the next roadblock.

The analyst may access data from an enterprise application using a vendor-supplied, specially-built tool. If the data resides in a relational data mart or warehouse, the analyst then turns to scripting or SQL coding to access and query data. And if the data needs to be pulled from unstructured data such as a Hadoop data lake or NoSQL source, that sends the analyst to yet another set of tools for data access designed specifically to interact with these new data sources.

Jumping among the dozen or so tools needed just to access and prepare data complicates the process and hinders productivity. In most cases, analysts use multiple tools because they were trained by other analysts who have learned to fish needed data out of several different platforms.

Then, different data platforms output data differently. Whether the differences are in the formatting or the values, the user must reconcile them just to see all the data in the same place.

Finally, once the data is pulled from disparate sources, there are limits to the analysis that can be performed on it. Most of the time, the data ends up in a spreadsheet, and the tedious, error-prone task of building combined tables begins. That process decouples the data from its source, with no way to recouple it after the analyst has done the work of reconciling and combining.

How to Overcome This Barrier?

The problem of too many tools seems insurmountable in a world of multiple data sources. To insulate the business from the diverse data landscape, look for a tool that allows access to a variety of data sources.

Getting beyond individual platforms means that upkeep of the data connectors becomes the task of the vendor and not the customer.

When those vendors quickly support new and emerging data sources, the platform agnostic solution will treat all sources equally, effectively and at a lower total cost of ownership (TCO).

When vendors quickly support new and emerging data sources, a platform-agnostic solution will treat all sources equally, effectively and at a lower total cost of ownership (TCO).

Data governance and compliance

The analyst's quest for data from as many sources as possible challenges security, governance and compliance requirements imposed by corporate IT and sometimes the law.

Analysts need to access and keep copies of data they've combined in their own personal sandbox so they can massage it and perform their analytic magic. That puts them at odds with corporate IT, whose job is to know where data is and ensure that it is not being misused or misplaced in a way that increases the risk of it falling into the wrong hands. But that same corporate governance can hamper the free-range access to needed data or impede analysts from executing their projects in time to capitalize on business opportunities.

IT's perspective is not always grounded in governance. Sometimes, IT's workload and timeline differ from those of the analyst. For example, prudent data governance sometimes dictates that the data warehouse be updated on the weekend rather than in real time.

That conflict in priorities has led to the rise of shadow IT. As predictive analytics moves downstream from data scientists to line-of-business managers, more people ingeniously find alternative methods of getting the data they need, and fewer people rely on IT for it. The free range access leads to free-range data and tends to create more data complexity, sometimes resulting in greater levels of inaccuracy.

Data governance affects three different business issues:

  • Accessibility — Data access needs to be role-based as well as needs-based, so that the right person can access and interact with the appropriate data sources.
  • Completeness — Data access based on roles can be limiting, but corporate IT can grant cross-source data access to the appropriate audience, based on proper authorization, security, monitoring and, most important, business use case.
  • Accuracy — The more sources are intermingled, the more complex their outputs become. That affects accuracy, because sources are refreshed at different rates. Again, the communication between corporate IT and analysts needs to be open, and the process needs to be both agile and governed. While that may seem like a paradox, it is in fact the key to successful data access and preparation.

It becomes an issue of both governance and compliance when analysts use knowledge of their specific domain (for example, Human Resources) and the fact that they are engaging in shadow IT to take incomplete data and make it better.

The free-range access leads to free range data and tends to create more data complexity, sometimes resulting in greater levels of inaccuracy.

How to Overcome This Barrier?

The issues of governance, security and compliance can be resolved when corporate IT and analysts adopt collaboration as their primary objective. If the two teams work toward a common goal, the aforementioned issues resolve themselves.

IT and the business analysts can rally behind a business use case as a common goal. Ideally, this business use case delivers the changes that benefit and optimize the organization while using all the IT and data assets that have been curated by corporate IT.

The issues of governance, security and compliance can be resolved when corporate IT and analysts adopt collaboration as their primary objective.

Too many manual processes

Many analysts spend their time gathering and documenting requirements, accessing data from multiple sources, combining the data into business focused sets and subsets and delivering cleansed data and accurate analysis.

Usually, this work serves as the foundation of enterprise reports, dashboards and metrics. Too often, however, business analysts spend their time accessing and combining data, and not enough of it analyzing and understanding what the data is saying.

The manual tasks add up to a great deal of work, because analysts do not feel that corporate IT is equipped to deliver the needed data. As the world of analytics has evolved toward greater insight, data analysts are much more interested in the questions they can ask than the answers they receive. Data exploration has become a more important part of the job.

However, data analysts are well served if at least part of the process of mashing up data is automated. IT can automate many of the processes for creating enterprise and analytic reports for the business analyst. For example, month-end reports always pull from the same data sources and answer the same questions. IT can use a powerful data preparation platform to automate the data access and combination for those reports.

The queries are straightforward and never deviate from month to month. Even special queries can be automated. If analysts understand that they are combining the same sets of data repeatedly, IT can design systems accordingly. This special set of "super" data can be loaded into a memory store and refreshed frequently, then analysts can access it in a controlled, accurate and easy-to-use environment.

How to Overcome This Barrier?

The most efficient method for overcoming these particular challenges focuses on building agility into data preparation. Consider these three approaches:

  • Organizations can deploy software to support the heterogeneous data sources that spring up every day.
  • They can then build in user support and collaboration between business analysts and corporate IT. This change in process facilitates communication between internal and external stakeholders.
  • Management should drive the business to be more autonomous with data preparation activities, demanding both transparency and understanding of the metrics, reports and dashboards that analysts and IT deliver. Management should ask to have the content explained so that decisions are made on content that is relevant, up to date and meaningful.

Conclusion

Data analysts add the most value when they focus on deriving insight from data, not when they spend their time figuring out how to access data from multiple sources.

The roadblocks of data preparation nudge analysts into manual, error-prone tasks better suited to IT.

To get to the heart of the roadblock, analysts should try to reduce the number of tools they use and standardize on ones that automate data preparation. The emphasis is on working quickly with a variety of data sets and integrating multiple data sources. The right tool can help remove the main roadblocks — too many data sources, too many one-off tools, data governance and compliance, and too many manual processes — and free up analysts to spend more time working with data instead of preparing it.

About Quest

Quest helps our customers reduce tedious administration tasks so they can focus on the innovation necessary for their businesses to grow. Quest® solutions are scalable, affordable and simple-to-use, and they deliver unmatched efficiency and productivity. Combined with Quest's invitation to the global community to be a part of its innovation, as well as our firm commitment to ensuring customer satisfaction, Quest will continue to accelerate the delivery of the most comprehensive solutions for Azure cloud management, SaaS, security, workforce mobility and data-driven insight.