Data Warehouse Automation & Real-time Data – Reducing Time to Value in a Distributed Analytical Environment
Creating Data Products in a Data Mesh, Data Lake or Lakehouse for Use in Analytics (9-10 June 2022, Stockholm)
Smart Infrastructure & Smart Applications for the Smart Business – Infrastructure & Application Performance Monitoring
Last week in London I spoke at the IRM Data Warehousing and Business Intelligence conference on a variety of topics. One of these was Big Data which I looked at in the context of analytical processing. There is no question the hype around this topic is reaching fever pitch so I thought I would try to put some order on it.
First, I am sure like many other authors in this space I need to define Big Data in the context of analytical processing to make it clear what we are talking about. Big Data is a marketing term and not the best of terms at that. A new reader in this market may well assume that this is purely about data volumes. Actually this is about being able to solve business problems that we could no solve before. Big data can and more often than not include a variety of ‘weird’ data types. In that sense big data can be structured or poly-structured (where poly in this context means many). The former would include high volume transaction data such as call data records in telcos, retail transaction data and pharmaceutical drug test data. Poly-structured data is more difficult to process and includes semi-structured data like XML and HTML and unstructured data like text, image, rich media etc. Graph data is also a candidate.
From the experiences I have had in working in this area to date, I would say that web data, social network data and sensor data are emerging as very popular types of data in big data analytical projects. Web data includes web logs and e-commerce logs such as those generated by on-line gaming and on-line advertising data. Social network data would include twitter data, blogs etc. These are examples of interaction data which is something that has grown significantly over recent years. Sensor data is machine generated data from ’An Internet of Things’. It is something we have only seen the beginning of in my opinion as much of it remains un-captured. RFIDs are probably the most written about of sensors. However these days we have sensors to measure temperature, light, movement, vibration, location, airflow, liquid flow, pressure and much more. There is no doubt that sensor data is on the increase and in my opinion it is something that will dwarf pretty well everything in terms of volume. Telcos, utilities, manufacturing, insurance, airlines, oil and gas, pharmaceuticals, cities, logistics, facilities management and retail…..they are all jumping on the opportunity to use of sensor data to ‘switch on the lights’ in parts of the business where they have had no visibility before. Sensor data is massive but we don’t want it all – it is the variance we are interested in. Many Big Data analytical applications are/will emerge on the back of sensor data. These include analytical applications for use in:
- Supply chain optimisation
- Energy optimisation via sustainability analytics
- Asset management
- Location based advertising
- Grid health monitoring
- Smart metering
- Traffic optimisation
- Etc., etc.
Text as I already mentioned is also a prime candidate for big data analytical processing. Sentiment analysis, case management, competitor analysis are just a few examples of a popular types of analysis on textual data. Data sources like Twitter are obvious candidates but tweet stream data suffers from data quality problems that still have to be handled even in a big data environment. How many times do you see spelling mistakes in tweets for example.
There is a lot going on that is of interest to business in big data but while all of it offers potential return on investment, it is also increasing complexity. New types of data are being captured from internal and external data sources, there is an increasing requirement for faster data capture, more complex types of analysis are now in demand and new algorithms and tools are appearing to help us do this.
There are several reasons why big data is attractive to business. Perhaps for the first time, entire data sets can now be analysed and not just subsets. This is now a feasible option whereas it was not before. So it is making enterprise think can we go down a level of detail? Is it worth it? Well to many it most certainly is. Even a 1% improvement brought about by analysing much more detailed data is significant for many large enterprises and well worth doing. Also schema variant data can now be analysed for the first time which could add a lot of valuable insight to that offered up by traditional BI systems. Think of an insurance company for example. Any insurer whose business primarily comes from a broker network will receive much of its data in non-standard document format. Only a small percentage of that data finds its way into underwriting transaction processing systems while much of the valuable insight is left in the documents. Being able to analyse all of the data in these documents could offer up far more business value that could improve risk management and loss ratios.
At the same time there are inhibitors to big data analysis. These include finding skilled people and a real lack of understanding around when to use Hadoop versus when to use Analytical RDBMS versus NoSQL DBMS. On the skills front there is no question that the developers involved in Big Data projects are absolutely NOT your traditional DW/BI developers. Big Data developers are primarily programmers – not a skill often seen in a BI team. Java programmers are aften seen at big data meet ups. In addition, the analysis is primarily batch oriented with map / reduce programs being run and chained together using scripting languages like Pig Latin and JAQL (if you use the Hadoop stack that is)
Challenges with Big Data
There is no question that big data offers up challenges. These include challenges in the areas of:
- Big data capture
- Big data transformation and integration
- Big data storage – where do you put it and what are the options?
- Loading big data
- Analysing big data
Over this and my next few blogs we will look at these challenges. Looking at the first one on big data capture, the issues are latency and scalability. Latency needs change data capture, micro batches etc. However I think it is fair to say that if Hadoop is chosen as the analytical platform, it is not geared up for very low latency. Very low latency would lean towards stream processing as a big data technology which I will address in another blog. Scaling data integration to handle Big Data can be tackled in a number of ways You can use DI software that implements ELT processing i.e. exploits the parallel processing power of an underlying MPP based analytical database. You can make use of data integration software that has been rewritten to exploit multi-core parallelism (e.g. Pervasive DataRush). Alternatively you can use data integration accelerators like Syncsort DMExpress or exploit Hadoop Map/Reduce from within data integration jobs e.g. Pentaho Data Integrator. Or you could use specialist data integration software like Scribe log aggregation software (originally written by Facebook). Also vendors like Informatica have also announced a new HParser to help with data in a Hadoop environment.
With respect to storing data, there are a number of storage options for analysing Big Data. They range from:
- Classic relational RDBMS (e.g. IBM DB2, Oracle, MySQL, Microsoft SQL Server)
- Analytical RDBMS (e.g. ExaSol, HP Vertica, IBM Netezza, ParAccel, Oracle Exadata, Teradata)
- Hadoop solutions (e.g. HDFS, HBase and Hive)
- Analytical RDBMS with Hadoop Map/Reduce integration (e.g. Teradata AsterData, EMC GreenPlum HD)
- NoSQL DBMSs.
Let’s dispel a myth right away. The idea that relational database technology cannot be used as a DBMS option for big data analytical processing is plain nonsense. Any analyst opinion claiming that should be ignored. Teradata, ExaSol, ParAccel, HP Vertica, IBM Netezza are all classic examples of analytical RDBMSs that can scale to handle big data applications with some of these vendors having customers in the Petabyte club. Improvements such as solid state disk, columnar data, in-database analytics and in-memory processing have all helped Analytical RDBMSs scale to higher heights. So it is an option for a big data analytical project perhaps more so with structured data.
Hadoop is an analytical big data storage option that has often been associated more with poly-structured data. Text is a common candidate. NoSQL databases like Neo4J or InfiniteGraph graph databases are candidates particularly in the area of Social Network influencer analysis. So it depends on what you are analysing.
Going back to Hadoop, the stack includes HDFS – a distributed file system that partitions large files across multiple machines for high-throughput access to application data. It allows us to exploit thousands of servers for massively parallel processing which can be rented on a public cloud if needs be. To exploit the power of Hadoop, developers code programs using a programming framework known as Map/Reduce. These programs run in batch to perform analysis and exploit the power of thousands of servers in a shared nothing architecture. Execution is done in two stages. Map and Reduce. Mapping refers to the process of breaking a large file into manageable chunks that can be processed in parallel. Reduce then processes the data to produce results. Hadoop Map/Reduce is therefore NOT a good match where:
- Low latency is critical for accessing data
- Processing a small subset of the data within a large data set
- Real-time processing of data that must be immediately processed
Also Hadoop is not normally a RDBMS competitor either. On the contrary it expands the opportunity to work with a broader range of content and so Big Data analytical processing conducted on Hadoop distributions is often upstream from traditional DW/BI systems. The insight derived from that processing then often finds its way into a DW/BI system. There are a number of Hadoop distributions out there including Cloudera, EMC GreenPlum HD (a resell of MapR), Hortonworks, IBM InfoSphere BigInsights, MapR and Oracle Big Data Appliance. Hadoop is still an immature space with vendors like ZettaSet bolstering the management of this kind of environment. To appeal to the SQL developer community Hive was created with a SQL like query language. In addition Mahout supports a lot of analytics than can be used in Map/Reduce programs. It is an exciting space but by no means a panacea. Vendors such as IBM, Informatica, Radoop, Pervasive (TurboRush for Hive and DataRush for Map/Reduce, Hadapt, Syncsort (DMExpress for Hadoop Acceleration), Oracle, and many others are all trying to gain competitive advantage by adding value to it. Some enhancements appeal more to Map/Reduce developers (e.g. Teradata, IBM Netezza, HP Vertica connectors to Cloudera) and some to SQL developers (e.g. Teradata AsterData SQL Map/Reduce, Hive). One thing is sure – both need to be accommodated.
Next time around I’ll discuss analysing big data in more detail. Look out for that and if you need help on a Big Data strategy feel free to contact me