Data Life Cycle in A Big Data Environment

The arrival of Big Data has seen a more complex analytical landscape emerge with multiple data stores beyond that of the traditional data warehouse. At the centre of this is Hadoop. However other platforms like data warehouse appliances, NoSQL graph and stream processing platforms have also pushed their way into the analytical landscape. Now that these new platforms together form the analytical ecosystem it is clear that managing data in a big data environment has become more challenging. It is not just because of the arrival of new kinds of data store. It is the characteristics of Big Data like volume, variety and velocity that also make big data more challenging to manage. The arrival of data at very high rates and the volume of big data means that certain data management activities need to be automation instead of manual. This is especially the case in data lifecycle management.

In any data life cycle, data is created, shared, maintained, archived, retained and deleted. In a big data environment this also occurs. But when you add characteristics like volume, variety and velocity into the mix then the reality of managing big data throughout its lifecycle really challenges the capability of existing technology.

In the context of big data lifecycle management a key question is what is the role of Hadoop? Of course Hadoop could be several roles. Popular ones include:

Hadoop as a landing zone
Hadoop as a data refinery
Hadoop as a data hub
Hadoop as a Data Warehouse archive

Hadoop as a landing zone and Hadoop as a data refinery fit well into the data lifecycle between CREATE and SHARE. With big data, what happens between CREATED and SHARED is a challenge. Before data is shared it needs to be trusted. That means that data needs to be captured, profiled, cleaned, integrated, refined, and sensitive data identified and masked to prevent shared sensitive data being visible to unauthorised users.

In the context of Hadoop as a landing zone, big data may arrive rapidly. Therefore technologies like stream processing are needed to filter data in motion to only capture the data of interest. Once captured, volume and velocity could easily make it impossible to profile captured data manually as would be typical in a data warehouse staging area. Therefore automated data profiling and relationship discovery is needed so that attention is drawn quickly to data that needs to be cleaned. Data cleansing and integration also needs to exploit the power of Hadoop MapReduce for performance and scalability on ETL processing in a big data environment.

Once big data is clean we can enter the data refinery which is of course when we see the use of Hadoop as an analytical sandbox. Several analytical sandboxes may be created and exploratory analysis performed to identify high value data. At this point auditing and security of access to big data needs to be managed to monitor exactly what activities are being performed on the data so that unauthorised access to analytical sandboxes and the data within is prevented. If master data is brought into Hadoop in the refining process, to analyse big data in context, then it must be protected and sensitive master data attributes like personal information needs masked before it is brought into a data refinery to add context to exploratory analysis tasks.

Once data refining has been done then new data can be published for authorised consumption. That is when Hadoop takes on the role of a data hub. Data can be moved from the Data Hub into Hadopp HDFS, Hadoop Hive, Data warehouses, Data Warehouse Analytical Appliances and NoSQL databases (e.g. graph databases) to be combined with other data for further analysis and reporting. At this point we need built-in data governance whereby business users can enter the Data Hub, and subscribe to receive data sets and data subsets in a format they require. In this way even self-service access to big data is managed and governed.

Going further into the lifecycle, when data is no longer needed or used in data warehouses, we can archive it. Rather than doing that to tape and taking it offline, Hadoop gives us the option to keep it on-line by archiving it to Hadoop. However, if sensitive data in a data warehouse is archived, we must make sure masking is applied before it is archived to protect it form unauthorised use. Similarly if we archive data to Hadoop, we may need to archive the data from several databases (e.g. data warehouses, data marts and graph databases) to preserve integrity and make a clean archive.

Finally there will be a time when we delete data. Even with storage being cheap, the idea that all data will be kept is just not going to happen. We need policies to govern that of course. Furthermore, similar to archive, these policies need to be applied across the whole analytical ecosystem with the aded complexity that big data means we need these tasks to be executed at scale. These are just some of the things to think about in governing and managing the big data life cycle. Join me on IBM’s Big Data Management Tweet chat on Wednesday April 16th at 12:00 noon EST to discuss this in more detail.

Back

Featured Content

Blog

Register for additional content