Creating Data Products in a Data Mesh, Data Lake or Lakehouse for use in Analytics (24-25 November 2022 – Amsterdam)
Centralised Data Governance of a Distributed Data Landscape (24-25 October 2022 – Live Streaming Event)
Creating Data Products in a Data Mesh, Data Lake or Lakehouse for use in Analytics (26-27 September 2022, Live Streaming Event)
Data Warehouse Automation & Real-time Data – Reducing Time to Value in a Distributed Analytical Environment
Centralised Data Governance of a Distributed Data Landscape (28-29 November 2022 – Live Streaming Event)
Centralised Data Governance of a Distributed Data Landscape (19-20 October 2022 – Live Streaming Event)
Smart Infrastructure & Smart Applications for the Smart Business – Infrastructure & Application Performance Monitoring
Creating Data Products in a Data Mesh, Data Lake or Lakehouse for use in Analytics (17-18 October 2022 – Live Streaming Event)
There is no doubt that today self-service BI tools have well and truly taken root in many business areas with business analysts now in control of building their own reports and dashboards rather that waiting on IT to develop everything for them. Using data discovery and visualisation tools, like Tableau, Qlikview, Tibco Spotfire, MicroStratagy and others, business analysts can produce insights and publish them for information consumers to access via any device and subsequently act on. As functionality deepens in these tools, most vendors have added the ability to access multiple data sources so that business analysts can ‘blend’ data from multiple sources to answer specific business questions. Data blending is effectively lightweight data integration. However this is just the start of it in my opinion. More and more functionality is being added into self-service BI tools to strengthen this data blending capability and I can understand why. Even though most business users I speak to would prefer not to integrate data (much in the same way in which they would prefer not to write SQL), the fact of the matter is that the increasing number of data stores in most organizations is driving the need for them to have to integrate data from multiple data sources. It could be that they need data from multiple data warehouses, from personal and corporate data stores, from big data stores and from a data mart or some other combination. Given this is the case, what does it mean in terms of impact on enterprise data governance? Let’s take a look.
When you look at most enterprise data governance initiatives, these initiatives are run by centralized IT organizations with the help of part time data stewards scattered around the business. Most enterprise data governance initiatives start with a determined effort to standardize data. This is typically done by establishing common data definitions for core master data, reference data (e.g. code sets), transaction data and metrics. Master data and reference data in particular are regularly the starting point for enterprise data governance. The reason for this is that this data is so widely shared across many applications. Once a set of common data definitions has been defined (e.g. for customer data), the next step is to discover where the disparate data for each entity (e.g. for customer) is out there in the data landscape. This requires that data and data relationship discovery is done to find all instances of the same data out there across the data landscape. Once the disparate data has been found, it becomes possible to map the discovered disparate data back to the common data definitions defined earlier, profile the discovered disparate data to determine its quality and then define any rules needed to clean, transform and integrate that data to get it into a state that is fit for business use. Central IT organisations typically use a suite of data management tools to do this. Not only that, but all the business metadata associated with common data definitions and the technical metadata that defines data cleansing, transformation and integration is typically all recorded in the data management platform metadata repository. People who are unsure of where data came from can then view that metadata lineage to see where a data item came from and how it was transformed.
Given this is the case, what then is the impact of self-service BI given that business users are now in a position to define their own data names and integrate data on their own using a completely different type of tool from those that are provided in a data management platform? Well it is pretty clear that even if central IT do a great job of enterprise data governance, the impact of self-service BI is that it is moving the goal posts. If self-service BI is ungoverned it could easily lead to data chaos with every user creating reports and dashboards with their own personal data names and every user doing their own personal data cleansing and integration. Inconsistency could reign and destroy everything an enterprise data governance initiative has worked for. So what can be done? Are we about to descend into data chaos? Well first of all self-service BI tools are now starting to record / log a users actions on data while doing data blending to understand exactly how he or she has manipulated data. That is of course a good thing. In other words self-service BI tools are starting to record metadata lineage – but in their repository and not that of a data management platform. Also reports are available on lineage in some self-service BI tools already. A good example here is QlikView who do support metadata lineage and can report to show what has happened to data. However there appears to be no standards here for metadata import from existing data management platform repositories to re-use data definitions and to re-use transformations (other than XMI for basic interchange and even then there is no guarantee that interchange occurs). Other self-service BI tool users may be able to re-use data transformations defined by a different user but, as far as I can see, this is only possible when the same tool is being used by all users. The problem here is that there appears to be no way to plug self-service BI into enterprise data governance initiatives and certainly no way to resolve conflicts if the same data is transformed and integrated in different ways by central IT on the one hand using a data management toolset and by business users on the other using a self-service BI tool. If the self-service BI tool and the data management platform are from the same vendor you would like to think they would share the same repository but I would strongly recommend you check this as there is no guarantee.
The other issue is how to create common data names. I see no way to drive consistency across both self-service BI tools (where reports and dashboards are produced) and centralised data governance initiatives that use a business glossary especially if both technologies are not from the same vendor. Again, even if they are, I would strongly recommend you check that integration between a vendor’s self-service BI tool and the same vendor’s data management tool suite is in place.
The third point to note is that BI development is now happening ‘outside in’ i.e. in the business first and then escalated up into the enterprise for enterprise wide deployment. I have no issue with this approach but if this is the case then enterprise data governance initiatives starting at the centre and driving re-use out to the business units are diametrically opposed to what is happening in self-service BI. Ideally what we need is data governance from both ends and the ability to share common data definitions to get re-use of data names and data definitions as well as common data transformation and integration rules to get re-use across both environments. However, in reality it is not yet happening because stand alone self-service BI tools vendors have not implemented metadata integration across BI and heterogeneous data management tool suites. Today I regrettably have to say that in my honest opinion it is not there. This lack of integration spells only one thing ….re-invention rather than re-use. Self-service BI tools vendors are determined to give all power to the business users and so like it or not, self-service data integration is here to stay. And while these tools vendors are rightly recording what every user does to data to provide lineage, there are no metadata sharing standards between heterogeneous data management platforms (e.g. Actian (formerly Pervasive), Global IDs, IBM InfoSphere, Informatica, Oracle, SAP Business Objects, SAS (formerly Dataflux)) and heterogeneous self-service BI tools. If this is the case it is pretty obvious that re-use of common data definitions and re-use of transformations is not going to happen across both environments. The only chance is if both the data management platform and the self-service BI tools are out of the same stable but if you are looking for heterogeneous BI tool integration then it is not guaranteed as far as I can see. All I can recommend right now is if you are tackling enterprise data governance – go and see your business users and educate them on the importance of data governance to prevent chaos until we get metadata sharing across heterogeneous self-service BI tools and data management platforms. If you want to learn more about this please join me for my Enterprise Information Management masterclass in London on 26-28 February 2014