Data Warehouse Automation & Real-time Data – Reducing Time to Value in a Distributed Analytical Environment
In my last blog I looked at big data governance and how it produces confidence in structured and multi-structured data that data scientists want to analyse. I would like to continue that theme in this blog by looking at what is happening in the area of Big Data governance in a little more detail. Over the last two years we have seen several data management software vendors extend their products to support Big Data platforms like Hadoop. Initially this started out as supporting it as both a target to provision data for exploratory analysis and as a source to move derived insights from Hadoop into data warehouses. However several vendors have evolved their data cleansing and integration tools to exploit Hadoop by implementing ELT processing on that platform much like they did on data warehouse systems. Scalability and cost are major reasons for this. This has prompted some organizations to consider loading all data into a Hadoop cluster for ELT processing via generated 3GL, Hive or Pig ELT jobs running natively on a low cost Hadoop cluster. (see below)
Several vendors have now added new tools to parse multi-structured data as well as MapReduce transforms to run data cleansing and data integration ELT processing on large volumes of multi-structured data on Hadoop.
Extending data governance and data management platforms to exploit scalable Big Data platforms not only allows customers to get more out of existing investment but also improves confidence in Big Data among data scientists and business analysts who want to undertake analysis of that data to produce valuable new insight. Of course developers could do this themselves using Hadoop HDFS APIs. However, given that many data cleansing and integration tools also provide support for metadata lineage, it means that data scientists and business analysts working in a Big Data environment have access to metadata to allow them to see how Big Data has been cleaned and transformed en route to making that data available for exploratory analysis. This kind of capability just breeds confidence in the use of data. In addition there is nothing to stop Data Scientists making use of these tools by exploiting pre-built components and templates. Having workflow based data capture, preparation and even analytical tools available in a Big Data environment also improves productivity when programming skills are lacking.
I shall be discussing Big Data governance in more detail at my up and coming Big Data Multi-Platform Analytics class in London on October 17-18. Please register here if you want to attend.