Data Warehouse Automation & Real-time Data – Reducing Time to Value in a Distributed Analytical Environment
Smart Infrastructure and Smart Applications for the Smart Business – Infrastructure and Application Performance Monitoring
This is a ComputerWeekly guest blogpost by analyst Mike Ferguson on the 10th anniversary of Hadoop’s becoming a separate Apache subproject.
In the 10 years since Hadoop became an Apache project the momentum behind it as a key component platform in big data analytics has been nothing short of enormous.
In that time we have seen huge strides in the technology with new ‘component’ Hadoop applications being contributed by vendors and organisations alike. Hive, Pig, Flume, Sqoop, to name a few have all become part of the Hadoop landscape accessing data in the Hadoop Distributed File System (HDFS).
However it was perhaps the emergence of Apache Hadoop YARN in 2013 that opened the flood gates by breaking the dependency on MapReduce. Today we have Hadoop distributions from Cloudera, Hortonworks, MapR, IBM, Microsoft as well as cloud vendors like Amazon with EMR, Altiscale and Qubole.
Yet the darling technology today is Spark with scalable massively parallel in-memory processing. It can run on Hadoop or on its own cluster and access HDFS, cloud storage, RDBMSs and NoSQL DBMSs. It is a key technology combining the ability to do streaming analytics, machine learning, graph analytics and SQL data access all in the same execution environment and even in the same application.
AMPlab and then Databricks have progressed Spark functionality to the point where even vendors the size of IBM have strategically committed to its future.
From a developer perspective, we have progressed way beyond just Java with languages like R, Scala and Python now in regular use. Interactive workbenches like Apache Zepplin have also taken hold in the development and data science communities, speeding up analysis.
Today we are entering a new era. The era of automation and the lowering of skills to let in and convert self-service business analysts into so-called ‘citizen data scientists’. Data mining tools like KNIME, IBM SPSS and RapidMiner are already supporting in-memory analytics by leveraging the analytic algorithms in Spark. SAS is also running at scale in the cluster but with its own in-memory LASR server.
There is also a flood of analytic libraries emerging like ADAM and GeoTrellis with IBM also open sourcing SystemML.
The ETL vendors have all moved over to run data cleansing and integration jobs natively on Hadoop (e.g. Informatica Blaze, IBM BigIntegrate and Big Quality) or by running on top of Spark (e.g. Talend).
Also, Spark based self-service data preparation vendor startups have emerged such as Paxata, Trifacta, Tamr and Clear Story Data. On the analytical tools front, there too we have seen enormous strides. Search based vendors like Attivio, Lucidworks, Splunk and Connexica all crawl and index big data in Hadoop and relational data warehouses. New analytical tools like Datameer and Platfora were born on Hadoop with the mainstream BI vendors (e.g. Tableau, Qlik, MicroStrategy, IBM, Oracle, SAP, Microsoft, Information Builders and many more) having all built connectors to Hive and other SQL on Hadoop engines.
If that is not enough, check out the cloud. Amazon, Microsoft, IBM, Oracle, Google all offer Hadoop as a Service. Spark is available as a service and there are analytics clouds everywhere.
If you think we are done you must be kidding. Apache Flink, security is still being built out with Apache Sentry, Hortonworks Ranger, Zettaset, IBM Guardium and more. Oh, and data governance is finally getting done but still work in progress with the emergence of the information catalogue (Alation, Waterline Data, Semanta, IBM) together with reservoir management, data refineries …. Exhausting isn’t it?
Without a doubt, Hadoop along with Spark has and is transforming the analytical landscape. It has pushed analytics into the boardroom. It has extended the analytical environment way beyond the data warehouse but is not replacing it. ETL offload is a common use case to take staging areas off data warehouses so that CIOs can avoid data warehouse upgrades. And yet more and more data continues to pour into the enterprise to be processed and analysed. There is an explosion of data sources with a tsunami of them coming over the horizon from the Internet of Things.
SQL on Hadoop
But strangely, here we are with increasingly fractured distributed data and yet business demands more agility. Thank goodness for the fact that SQL prevails. Like it or loathe it, the most popular API on the planet is coming over the top of all of it. I’m tracking 23 SQL on Hadoop engines right now and that excludes the data virtualisation vendors! Thank goodness for data virtualisation and external tables in relational DBMS . If you want to create the logical data warehouse, this is where it is going to happen. Who said relational is dead? Federated SQL queries and optimisers are here to stay. So…are you ready for all this? Do you have a big data and analytics strategy, a business case, a maturity model and a reference architecture? Are you organised for success? It you want to be disruptive in business you’ll need all of this. If you are still trying to figure it all out you can get help here.
Happy Birthday Hadoop!