As a business intelligence practitioner, I’ve spent countless hours trying to deliver the ever-elusive ‘single version of the truth.’ In other words, sourcing data, addressing quality issues, integrating disparate sources, enforcing key business rules, curating master data and trying to get it all to happen consistently within a batch update window so the business has the information it needs to operate.
While the tools and practices have improved and adapted to become more agile over the years, let’s face it: adding a new data set to the warehouse doesn’t happen fast if you put it through even the basics of these practices. Never mind the fact that the request is likely queued up behind other requests the data warehouse team is handling.
So what do you do? If you’re a business analyst, you just bypass the data warehouse team and BYOD. Not a problem, right? Sometimes you pull from the warehouse, sometimes the source systems, sometimes from a spreadsheet. Data issues? You fix them yourself. Need data enrichment? You enrich the data yourself. What’s the worry?
There’s validity to allowing analysts to BYOD. Some business decisions can’t wait. For many questions, data that is ‘close enough’ is good enough. But the downside here is twofold: analysts often don’t have the tools they need to efficiently deal with the data. And very often you get duplicated effort among analysts and inconsistencies between data sets. So, you can never be sure if one analysis is really comparable to another. Did the number really change? Or just the way the data was prepared?
So how do you balance the need for dependable data against the need for decision speed? You need to define what correct means, manage the Data Maturity Lifecycle, and enable analysts within the process.
What is correct?
First things first. Getting the data ‘correct’ isn’t a one-size-fits-all affair. The meaning of ‘correct’ depends on the requirements. Different aspects of ‘correct’ are trade-offs against each other, and very few data sets require hitting all of these aspects at once. Getting the data correct can imply:
> See also: Big data: not a magic pill, but an antidote
Accuracy: The number reported accurately reflects what happened. Accuracy tends to be a major concern for regulated reporting.
Timeliness: The amount of time that can elapse between an event and the associated data being reported. Timeliness tends to be a major concern when a lag in decision making has a significant effect.
Consistency: The number reported is consistent with source systems, related business systems and with numbers previously reported. This is both in terms of reporting a consistent value and having a consistent definition. Any comparative analysis (year-over-year) demands consistency.
Quality: Quality can be thought of as a subset of accuracy, but it involves some specialised concepts. Quality encapsulates things like getting addresses correct and ensuring that fields follow business rules, data is not duplicated and records aren’t orphaned. If actions are directly affected by data quality, this becomes a higher concern.
Performance: Can I get a response to my question before I forget what my question was? As more people use the system, will performance scale? This is one of the largest efforts in warehousing: modelling and tuning the data to ensure performance.
Security: After working so hard to make data available, you also have to make sure it isn’t ‘too available.’ This ranges from helping users cut through the noise to preventing insider trading and other legal issues.
For any data set, everyone needs to be clear on what ‘correct’ means. And if the business doesn’t require one of these attributes, don’t impose it unnecessarily. It will slow you down in meeting requirements.
Agility and maturity
The first mistake is trying to apply ‘too much correctness’ to a data set when not required. The second mistake is trying to get there all at once. One of the key lessons I’ve taken from my agile programmer colleagues is the idea of a ‘minimally viable product.’ Basically, the idea is that you don’t have to get to the final state all at once. You just need to deliver enough value now to meet the immediate requirement, and then mature the implementation in response to user feedback. What does this mean with data?
When working with new data sets, requirements are volatile. The analysts are still learning what they can do with the data. I take the requirements they have, and I try to implement them with the least amount of overall ‘processing’ in the initial pass. I will do as little transformation, quality, modeling, tuning, metadata building and other data work as possible. I draw the shortest path from data source to initial view of the data for my end users and deliver it. Direct access to source system data? If it works for the initial requirement, yes.
It will take some sophistication to help end users understand which data sets are ‘fully fledged’ and which ones are ‘still baking,’ but the agility it brings is worth it. As the requirements mature, I mature the implementation of the data set. In such an approach, you initially sacrifice aspects like scalability or fully conformed dimensions, but gain agility in meeting emerging business needs. Yes, you’ll need to go back in future iterations to address scalability, conforming, etc., but you’ll be more informed by actual usage when you do. You’ll also feel better when the analysts see something for the first time and say 'Yeah … that wasn’t it.'
So what do you do about business analysts bringing their own data? I mean, we’ve been fighting off spreadmarts for decades now, right? Well, actually …
Business analysts become more data savvy every day. The reality is that the warehouse team should focus on delivering the high-value, hard-to-deliver data sets to support analysts. Then, we should get out of their way so they can use the tools and techniques they prefer to conduct their analyses. And, when needed, they should be able to bring their own data. In fact, they serve as a kind of R&D department for the warehouse team to identify emerging requirements.
So how do you avoid ‘bad data from ‘untrusted sources’? With BYOD, the focus should shift away from the data to who is bringing the data. Business leaders, with some ‘advice and consent’ from the BI team, should identify those analysts who have the skills, business knowledge and past performance that indicate they can be trusted to combine standard warehouse data with new data sources, advise business leaders on how the data can be used given concerns like accuracy, and generally be trusted to deal with data at a sophisticated level. They become ‘certified’ and are given the authority (and responsibility) for authoring new data sets. And as their data sets mature, we bring them into the warehouse.
And what about the ‘uncertified’ analysts? Can they create data? Sure they can. But the organisation should be mature enough to recognise that the data set being presented should be treated with a higher-than-normal level of skepticism. It could still be valuable and lead to some great insight, but may need one of the certified analysts to follow up before making a big decision based on that information.
Many organisations get stuck in either single version of the truth fervour or wild-west spreadmart hell, neither of which ultimately serves their needs. The trick is to balance high-quality data services with highly agile BYOD practices.
Sourced from Charles Caldwell, director of solutions engineering and principal solutions architect for Logi Analytics