Data Warehouse Automation & Real-time Data – Reducing Time to Value in a Distributed Analytical Environment
Creating Data Products in a Data Mesh, Data Lake or Lakehouse for Use in Analytics (9-10 June 2022, Stockholm)
Smart Infrastructure & Smart Applications for the Smart Business – Infrastructure & Application Performance Monitoring
Exploratory analytics is at the heart of most Big Data projects. It involves loading data from multi-structured sources into ‘sandboxes’ for exploration and investigative analysis, often by skilled data scientists, with the intent on producing new insights.
Data being loaded into these sandboxes may include structured, modeled data from existing OLTP systems, data warehouses and master data management systems as well as un-modeled data from internal and external data sources. This could include customer interaction data, web log data, social network interactions, sensor data, documents, rich media content and more.
Because Big Data is often un-modeled, the schema of this data is not known – it is schema-less. The argument there is that data scientists need freedom to explore the data in any way they like to acquire it, prepare it, analyse it and visualize it.
Yet Big Data needs data governance to protect sensitive information made available in these exploratory environments. Given that governance imposes control and accountability, how then can the two seemingly opposing forces of freedom and control co-exist in a Big Data analytical environment? Is this not a tug of war? Is Big Data Governance a straight jacket for Data Scientists? How can the freedom to conduct exploratory analysis proceed if data being analysed is subjected to governance policies and processes that need to be applied?
The answer is obvious. Confidence. Big Data Governance is not about restricting data scientists from doing exploratory analysis. It is about extending the reach of data management and data governance technologies from the traditional structured data world into a Big Data environment to:
- Protect sensitive data brought into this environment
- Control who has access to Big Data files, tables and sandboxes
- Monitor data scientist and application activities in conformance with regulatory and legislative obligations
- Provide audit trails
- Provide data relationship discovery capability
- Improve the quality of Big Data before analysis occurs
- Integrate data from traditional and Big Data sources before analysis occurs
- Provide business metadata in a Big Data environment to data scientists and business analysts
- Provide the ability to assign new business data definitions and descriptions to newly discovered insights produced by data scientists before moving it into traditional data warehouses and data marts
- Provide metadata lineage in a Big Data environment to data scientists and business analysts
- Handle Big Data lifecycle management
All of this is about raising the bar in quality and confidence. Having high quality data before exploratory analysis takes place improves confidence in the data and also in the results. Therefore Big Data Governance is not an opposing force to free form exploratory analytics. On the contrary, it fuels confidence in Big Data analytical environments
I shall be discussing Big Data governance in more detail at my up and coming Big Data Multi-Platform Analytics class in London on October 17-18. Please register here if you want to attend.