Smart Infrastructure & Smart Applications for the Smart Business – Infrastructure & Application Performance Monitoring
Creating Data Products in a Data Mesh, Data Lake or Lakehouse for use in Analytics (17-18 October 2022 – Live Streaming Event)
Data Warehouse Automation & Real-time Data – Reducing Time to Value in a Distributed Analytical Environment
Creating Data Products in a Data Mesh, Data Lake or Lakehouse for use in Analytics (24-25 November 2022 – Amsterdam)
Centralised Data Governance of a Distributed Data Landscape (19-20 October 2022 – Live Streaming Event)
Centralised Data Governance of a Distributed Data Landscape (24-25 October 2022 – Live Streaming Event)
Centralised Data Governance of a Distributed Data Landscape (28-29 November 2022 – Live Streaming Event)
I read James Kobielus’s blog on Innovation Transforms Data Warehousing today talking about EDW in the Cloud as becoming mainstream by the middle of the decade.
My take on this is that configuration management and workload management are critical to large scale EDWs making it to the Cloud. I am thinking more about PRIVATE cloud than public cloud at this point as their are already an abundance of DW/BI PaaS and SaaS offerings on public clouds outside the firewall. However on-premise private cloud deployment is still very young. There is no doubt that small data marts are already moving but large scale EDW are not. Why not? I believe the reason is simple – no one has any experience how to configure virtual resources to make production EDW (Data integration, DBMS and BI platform) maximise it’s use of underlying hardware.
The point I would make here is more that I have not yet seen any detailed best practices from any source on porting bare metal EDWs to a virtualised private cloud in terms of the number of virtual servers needed, virtual memory needed, how to divide up virtual network channels or optimise virtual I/O. In addition there is little experience of what software to install on what virtual servers e.g. whether data integration, DBMS and BI platform should all go on the same virtual servers or on different ones and in what combination. Also, what happens in ELT processing versus ETL processing? Instead what I am seeing is pre-configured private cloud DW/BI appliances emerging where the large vendors are coming to market with appliances with the complex configuration work already done i.e. the configuration is already optimised for an analytic workload to get maximum value out of the hardware.
If you look at guidance documentation for virtualising RDBMSs, it is clear that to date at least, intricate knowledge of the underlying hardware is needed in order to set this up correctly. However there is another point here which is that the advice is to create virtual storage (all physical disks available as a huge storage pool visible to potentially all VMs), switch on self-tuning memory management and let multiple VM instances (each with their own DBMS instance) access the virtual storage. That is a virtualised shared disk architecture. Yet for years we have seen scalable bare metal EDWs built on shared nothing MPP architectures. There is no doubt that you could set up virtual resources on a private cloud to function as a virtual MPP architecture but who are you going to get to do that? Data centre cloud developers who need to second guess MPP DBMS? Or should we leave it to MPP based DBMS vendors who know how to make that work? What happens if a cloud administrator changes the configuration underneath a EDW? Does the DBA know? Will it impact the service?…what experience has a cloud developer got here? I would argue that it is very very early days. Only Teradata has any long term expertise in virtual MPP architectures for EDW with virtual resources such as Virtual AMPs having been around for well over a decade on their platform. Even then they ship their own hardware with it balanced out of the box and provide tooling to monitor every aspect of virtual resource in detail. Also no one has issued any guidance on balancing the right amounts of virtual resources or the best virtual configurations needed to scale data integration or BI platforms etc. So I agree that there is no way EDW on Private Cloud at least is a done deal. Certainly not this year. We still have a lot to learn. As one CIO put it to me recently, “why would I port a 100TB platform on dedicated hardware to a virtual server environment?”. Indeed….why would he? The bottom line is that for now at least he should not. There is no hard and fast rule that says everything HAS to be virtualised. I am not saying don’t do it. What I am saying however is that people will want proof before they start moving these kinds of systems off scalable bare metal hardware to go into virtual scalability.
We have also learned over the years that data integration often scales better using ELT rather than ETL. Pushdown optimisation is the secret here such that the data integration server leverages the underlying massively parallel DBMS to do the heavy lifting work. So, what happens if ELT pushdown optimisation is on an MPP architecture mapped on to a shared disk virtualised environment? Would that not impact performance? I would think so. Again…very little experience here compounded by the fact that cloud developers are not in the same teams as DBAs, data integration developers, BI developers…. So we are entering an era where the ground is being changed under our feet by developers not necessarily skilled in DW/BI system deployments. Is it any wonder that vendors shipping private cloud analytical appliances are getting attention?
With a data centre team configuring VMs and virtual resources and BI, data integration and DBA developers now NOT responsible for installation, I think we have a way to go before large enterprises will just ‘up sticks’ and move every part of a BI/DW system to the cloud. ‘Steady as you go’ is my advice