Data Warehouse Automation & Real-time Data – Reducing Time to Value in a Distributed Analytical Environment
Smart Infrastructure & Smart Applications for the Smart Business – Infrastructure & Application Performance Monitoring
Back in September last year, I presented at BiG Data London (the largest Big Data interest group in Europe), looking at multi-platform Big data analytics. In that session I looked at stand-alone platforms for analytical workloads and asked the question “Is it going to stay that way?’. Is it that Hadoop analytical workloads would remain separate from graph analytics and from complex analysis of structured data in addition to the traditional data warehouse? The answer in my eyes was of course a resounding no. What I observed was that integration was occurring across those platforms to create a single analytical ecosystem with enterprise data management moving data into and between platforms and in addition we had to hide the complexity from the users. One way of doing that was Data Virtualisation through connectivity to different types of relational and NoSQL data sources. However for me Data Virtualisation is not enough if it doesn’t also come with optimisation and to be fair to vendors like Cirro, Composite, Denodo and others they have been adding optimisation to their products for some time. The point about this is that if you want to connect to a mix of NoSQL DBMSs, Hadoop and Analytical RDBMSs as well as Data Warehouses, On-Line Transaction Processing Systems and other data then you very quickly start to need the ability to know where the data is in underlying systems. A global catalog is needed so that software knows that it needs to invoke underlying MapReduce jobs to get at Data in Hadoop HDFS, or that it accesses it directly by bypassing MapReduce via Impala for example. The point here though is that the user is still shielded from multiple underlying data sources and just issues SQL – a relational interface.
However the next option I looked at was the Relational DBMS itself. Step by step over the years Relational DBMSs have added functionality to fend off XML DBMSs, e.g. IBM DB2 can store native XML in the database – no need to shred it and stitch it back together. Oracle and all other RDBMSs added user defined functions to fend off Object DBMSs and so put pay to them also. Graph databases are an emerging NoSQL DBMS and now IBM DB2 and SAP HANA have added graph stores built into the Relational DBMS. I asked the question then “Is Relational going to Consume Hadoop?” It got one heck of a reaction opening up discussion in the break. Well let’s look further.
Teradata acquired Aster Data and now integrates with Hadoop via SQL-H to run analytics there or to bring that data into the Teradata Aster Big Analytics Appliance and analyse it there using SQL MapReduce functions. Hadoop vendors are adding SQL functionality. Hive was the first initiative. Since then we have had Hadapt, then Coudera announced Impala and just recently HortonWorks announced Stinger to dramatically speed up HIve.
Then came a new surge from the RDBMS vendors with the rise and rise of external table functions. Microsoft announced Polybase last November at its PASS conference. A really good article by Andrew Brust covers how Microsoft SQL Server 2012 Parallel Data Warehouse (PDW) will use Polybase to get directly at HDFS data in Microsoft’s HDInsight Hadoop distribution (its port of Hortonworks) bypassing MapReduce. So SQL queries come into PDW and it accesses Hadoop Data in HDInsight using Polybase (which will be released in stages).
And now today, EMC GreenPlum announced Pivotal HD which goes the whole hog and pushes the GreenPlum Relational DBMS Engine right into Hadoop directly on top of the HDFS file system and so its new PivotalHD Hadoop distribution announcement has a relational engine bolted right into it. MapReduce development can still occur and external table functions in GreenPlum database can invoke MapReduce jobs in Hadoop all inside the Pivotal HD Cluster ultimately with a GreenPlum MPP node instance lining up on every Hadoop data node. In short, the GreenPlum DBMS will use Hadoop HDFS as a data store. That of course presents a challenge to cater for all file formats that can be stored in HDFS but it is clear that it is not only EMC GreenPlum that is taking on the challenge.
In this case, as in the case of Microsoft (and the other major RDBMS vendors) the trend is obvious. We are seeing a new generation of optimizer – a cross platform optimizer that can figure out the best place to run the analytical query workload or part of an analytical query so that it exploits the RDBMS engine and/or the Hadoop cluster, or both to the max. That optimizer is going inside the RDBMS as far as I can see and the question will be what part of the execution plan runs in the relational engine accessing tabular data and what part of the execution plan pushes down analytics right into HDFS. It is already evident that MapReduce is getting bypassed. Impala does it, Polybase is going to do it and clearly EMC GreenPlum, IBM and Oracle will also while leaving to option to still run MapReduce jobs. We are in transition as the Big Data world collides with the traditional and RDBMSs push right down onto every Hadoop node to get parallel data movement between every Hadoop node and every MPP RDBMS node as queries execute. RDBMSs are going to pull HDFS data in parallel off the data nodes in a Hadoop cluster and/or push down query operators into Hadoop to exploit the full power of the Hadoop cluster and so push data back into the relational engine. It seems that Relational is going as close to Hadoop data as possible. Meanwhile everybody is beating a trail to self-service BI vendor’s doors like Tableau to simplify access to the whole platform. In this new set-up, data movement is going to have to be lightning speed! So where does ETL go?….Well…into the cluster with everything else right? Move it in parallel and exploit the power to the max.
Whether it be data virtualization with a SQL or web service interface or MPP RDBMS, one thing is clear. Just running connectors between a RDDBMS and Hadoop (or any other NoSQL DBMS for that matter) is not where this ends. These technologies are going nose to nose with tight integration in a whole new massively parallel engine and a next generation cross platform new optimizer to go with it. It’s not all about relational. Big Data has brought a whole new generation of technology and spawned a major phase in transition. It’s all going to be hidden from the user as relational opens its mouth and samples the best this new generation of technology can feed it. Who said relational was dead.